oist / EGU-The-functional-evolution-of-termite-gut-microbiota

This repository contains the scripts to run the analysis performed for the manuscript The functional evolution of termite gut microbiota
0 stars 1 forks source link

gtdb_ver95_alllca_taxid.csv.tar.gz #1

Open bheimbu opened 1 year ago

bheimbu commented 1 year ago

Hi,

I'm wondering where gtdb_ver95_alllca_taxid.csv.tar.gz is coming from? Have you written it by yourself or downloaded it somewhere? I'm using your pipeline to analyze the microbiome of Australian termites. But I want to use GTDB ver202 or later.

Additionally, I'd like to know where the gtf file ("/bucket/BourguignonU/Jigs_backup/working_files/AIMS/AIM2/tpm_functional_annotation/functional_annotation/all_functions_all_taxonomy/gtf_files_Dec2019/named-gtffiles/filename-230-13-prokka.map.gtf) is coming from, please. Found in hpc_tpmcal.md. Have you used this code snippet to do it?

Also, it is not clear to me, where this file (/bucket/BourguignonU/Jigs_backup/working_files/AIMS/paper1/markergenes/markers-rpkm/individualanalysis_feb2021/all-samples-prokTPM.txt") from here is coming from.

Cheers Bastian

Jigyasa3 commented 1 year ago

Hi @bheimbu ,

Thank you for your interest in the scripts!

  1. The file gtdb_ver95_alllca_taxid.csv.tar.gz is self written. But you can find that GTDB team has generated a code to create a similar (and even better) version of the same file. You can find it here-https://github.com/shenwei356/gtdb-taxdump
  2. The gtf files were generated from PROKKA. You found the right code snippet to convert.
  3. The file all-samples-prokTPM.txt was generated from the SALMON software I ran the SALMON software on all files and concatenated them together to generate this file.

Good luck with your analysis! Let me know if you have any other doubts!

bheimbu commented 1 year ago

Hi @Jigyasa3 ,

thanks for getting back to me. I'll see how far I can go. Actually I'm trying to implement your pipeline in a Snakenake workflow to make it more reproducible. So I may have some upcoming questions in the future -- just to let you know.

Cheers Bastian

bheimbu commented 1 year ago

Hi @Jigyasa3,

could you please clarify on:

The file gtdb_ver95_alllca_taxid.csv.tar.gz is self written. But you can find that GTDB team has generated a code to create a similar (and even better) version of the same file. You can find it here-https://github.com/shenwei356/gtdb-taxdump

I cannot find the mentioned code on the webpage?

Cheers Bastian

Jigyasa3 commented 1 year ago

Hi @bheimbu ,

The file gtdb_ver95_alllca_taxid.csv.tar.gz essentially creates a taxdump file for a specific version of GTDB. The GitHub page I linked allows you to create a taxdump file for any version of GTDB database. I haven't used it yet. I found it recently and was interested that the GTDB team has streamlined the process of using the database for DIAMOND/BLAST analysis.

They give details of the method in their README file. I recommend asking them directly as I haven't used it myself.

bheimbu commented 1 year ago

Hi @Jigyasa3,

I'm really sorry to bother you, but when I use the code on https://github.com/shenwei356/gtdb-taxdump, I get following files: delnodes.dmp, merged.dmp, names.dmp, nodes.dmp, and taxid.map. None of these files comes close to your gtdb_ver95_alllca_taxid.csv.tar.gz.

Is there a script or some line of code that you could share with me?

Cheers Bastian

Jigyasa3 commented 1 year ago

Hi @bheimbu ,

I did a Google search for you. Here are some suggestions.

  1. To incorporate the taxdump files from GTDB into DIAMOND- check this link https://www.biostars.org/p/412823/ and DIAMOND manual https://gensoft.pasteur.fr/docs/diamond/2.0.4/3_Command_line_options.html
  2. The GTDB equivalent of the files required to input into DIAMOND-https://github.com/shenwei356/gtdb-taxdump/issues/6
bheimbu commented 1 year ago

Hi,

thanks for all your help again. Besides that the links do not work, I'm wondering why u cannot tell me how you have created gtdb_ver95_alllca_taxid.csv.tar.gz, since you say

The file gtdb_ver95_alllca_taxid.csv.tar.gz is self written.

Anyway, if you don't want to share this information with me, I have to respect that.

I would have some more questions related to your pipeline:

  1. Prokka outputs fna and faa files, but they don't have the same fasta headers, right? So once you use fetchMGs to extract COGs using the Prokka files (faa and fna) as input, you only get COG protein sequences, but no protein-coding nucleotide sequences (see this related post). So I did you do it?

  2. Anyway, I'm a bit confused because these files

    while read line;do while read cogs;do cp ${line}/${cogs}*fna allfetchm_nucoutput/${line}-${cogs}.fna;done < allcogs.txt ;done <filesnames.txt

don't appear again somewhere in your pipeline, so are they really important anyway?

I'm really sorry to bother you with all these questions, but I just want to get things right.

Cheers Bastian

Jigyasa3 commented 1 year ago

Hey @bheimbu ,

  1. Sorry, the only reason I am redirecting you to other resources for creating the gtdb_ver95_alllca_taxid.csv.tar.gz file is that I have already left the university and I no longer have access to my university's cluster to check old scripts. From what I remember I joined the metadata file from GTDB and taxdump files to create gtdb_ver95_alllca_taxid.csv.tar.gz. It essentially adds a LCA taxonomy to each taxid. BTW, the links do work, you will have to copy and paste them. Somehow clicking on the link redirects you to the issues page of this repository.

  2. Prokka adds _1 to each protein fasta header at the end. So the first part of the header is common between the protein and nucleotide headers. I just matched the first part. Just to verify that I was matching to the correct nucleotide header- a) manually compared the annotation of some of the nucleotide and their corresponding protein sequences, b) used emboss online tool on some nucleotide sequences, and translated them to proteins which should be 100% identical to original protein sequences.

  3. Yes, you are right the filenames.txt file created from this while loop is not used again. This was just to keep track of how many files I was working with.

Let me know if you need anything!

bheimbu commented 1 year ago

Hi,

thanks for the clarification, I didn't know you left OIST. I'll have a second look on your provided links.

I'll see what I can do about gtdb_ver95_alllca_taxid.csv.tar.gz.

There are certainly more questions coming, but so far so good ;)

Have a nice weekend,

Bastian

bheimbu commented 1 year ago

Hi @Jigyasa3,

to be honest, I'm stuck. Right now, I'm trying to combine all my files as in combiningallfiles.md, but I'm failing on the first line.

My salmon quant files look like this:

Name    Length  EffectiveLength TPM NumReads
BEC328_contig1:440-1066 627 388.472 75.633197   9.000
BEC328_contig2:250-951  702 463.470 147.919989  21.000
BEC328_contig3:214-567  354 131.388 198.776314  8.000
BEC328_contig6:26-460   435 200.515 227.934551  14.000
BEC328_contig7:281-601  321 106.969 152.595815  5.000
BEC328_contig9:578-1465 888 649.470 60.318628   12.000
BEC328_contig10:7-867   861 622.470 73.424146   14.000

So tpm$fullproteinnames<-paste(tpm$file_name,tpm$gene_name,sep="_") is not even possible, because there are no columns file_name and gene_name. Sometimes I'm really thinking where not using the same software versions?

How do your fullproteinnames actually look like -- I'm just curious?!

Cheers Bastian

ps: This file also makes me wonder cogs<-read.csv("/bucket/BourguignonU/Jigs_backup/working_files/AIMS/paper1/markergenes/markers-rpkm/individualanalysis_feb2021/allcogs-allsamples-finalkrakenoutput.csv") as you mentioned before that diamond not kraken2 was used, actually.

Jigyasa3 commented 1 year ago

Hi @bheimbu ,

Yes, I think the software versions are different. But you can still run this code coz the data is similar although the names are different. To run tpm$fullproteinnames<-paste(tpm$file_name,tpm$gene_name,sep="_"), you can add the filename to your file using -

  1. awk command- awk ' { print FILENAME","$0} ' your_tpm_file_name > your_new_tpm_file_name
  2. in R- The first column will be the filename (i.e. column "file_name" of the Rscript ) and your "NAME" column is already the "gene_name" column. So you can combine them together now.

Sorry, as I said before I don't have access to the intermediate files as I am not at OIST anymore. But the final files generated from these scripts are publicly available if it helps- https://figshare.com/articles/dataset/Tables_for_main_figures/19173407

bheimbu commented 1 year ago

Thx for letting me know,

will try your suggestions tmrrw. Have a good one,

Bastian

bheimbu commented 11 months ago

Hi,

it's been a while. I hope you're fine and preparing for the holidays. I have a question:

BLASTp analysis against ANNOTREE database


#The "all-wood-gtdb.fasta.dmnd" created by adding proteinsequences from ANNOTREE database corresponding to gene(s) of interest-

diamond blastp --db ${DB_DIR}/all-wood-gtdb.fasta.dmnd --query ${IN_DIR}/${file1} --outfmt 6 --out ${OUT_DIR}/wood-gtdb-matches-${file1}.txt --threads 15


Where is the `all-wood-gtdb.fasta.dmnd` coming from? I'll try with this [database](https://software-ab.cs.uni-tuebingen.de/download/megan-annotree/welcome.html), and it works but I cannot relate the results to my kofam results as this outputs no KeggIDs only "gene_id" and "gtdb_id".

Cheers Bastian
bheimbu commented 10 months ago

Hi,

a different thing. I'd like to publish a snakemake workflow using some of your scripts (adjusted to my needs). That's why, I'd like to ask you if you want to be a co-author? Let me know your decision.

If not, I'll clearly state that your code was used extensively.

Cheers Bastian

Jigyasa3 commented 10 months ago

Hi @bheimbu ,

Thanks for the message! Sorry, I was very busy during and after the holidays! The Annotree KEGGids and sequences are coming from herehttp://annotree.uwaterloo.ca/annotree/app/. If you search for a KEGGID of interest, Annotree has the option to download a CSV file that contains the KEGGID, protein sequence, bacterialID etc. You can then extract the protein sequence and KEGGID in fasta format use that as a database for diamond blastp

I am super excited that you are ready to publish the Snakemake workflow! I would request to add my name to the codes that were used verbatim from my GitHub repository. But otherwise, you are welcome to acknowledge my name. Could you please also add the paper name associated with my codes for reference?

Good luck!

bheimbu commented 10 months ago

Hi,

I'm happy that you are on board. Can you give me your address details and ORCID ID (via email at bheimbu@gwdg.de)?

Of course, I will reference you. I'm preparing the manuscript right now and would be happy if you would provide some comments and feedback once it is finished.

Cheers Bastian