Open bheimbu opened 1 year ago
Hi @bheimbu ,
Thank you for your interest in the scripts!
gtdb_ver95_alllca_taxid.csv.tar.gz
is self written. But you can find that GTDB team has generated a code to create a similar (and even better) version of the same file. You can find it here-https://github.com/shenwei356/gtdb-taxdumpall-samples-prokTPM.txt
was generated from the SALMON software
I ran the SALMON software on all files and concatenated them together to generate this file.Good luck with your analysis! Let me know if you have any other doubts!
Hi @Jigyasa3 ,
thanks for getting back to me. I'll see how far I can go. Actually I'm trying to implement your pipeline in a Snakenake workflow to make it more reproducible. So I may have some upcoming questions in the future -- just to let you know.
Cheers Bastian
Hi @Jigyasa3,
could you please clarify on:
The file gtdb_ver95_alllca_taxid.csv.tar.gz is self written. But you can find that GTDB team has generated a code to create a similar (and even better) version of the same file. You can find it here-https://github.com/shenwei356/gtdb-taxdump
I cannot find the mentioned code on the webpage?
Cheers Bastian
Hi @bheimbu ,
The file gtdb_ver95_alllca_taxid.csv.tar.gz
essentially creates a taxdump file for a specific version of GTDB. The GitHub page I linked allows you to create a taxdump file for any version of GTDB database. I haven't used it yet. I found it recently and was interested that the GTDB team has streamlined the process of using the database for DIAMOND/BLAST analysis.
They give details of the method in their README file. I recommend asking them directly as I haven't used it myself.
Hi @Jigyasa3,
I'm really sorry to bother you, but when I use the code on https://github.com/shenwei356/gtdb-taxdump, I get following files: delnodes.dmp
, merged.dmp
, names.dmp
, nodes.dmp
, and taxid.map
. None of these files comes close to your gtdb_ver95_alllca_taxid.csv.tar.gz
.
Is there a script or some line of code that you could share with me?
Cheers Bastian
Hi @bheimbu ,
I did a Google search for you. Here are some suggestions.
Hi,
thanks for all your help again. Besides that the links do not work, I'm wondering why u cannot tell me how you have created gtdb_ver95_alllca_taxid.csv.tar.gz
, since you say
The file gtdb_ver95_alllca_taxid.csv.tar.gz is self written.
Anyway, if you don't want to share this information with me, I have to respect that.
I would have some more questions related to your pipeline:
Prokka outputs fna and faa files, but they don't have the same fasta headers, right? So once you use fetchMGs to extract COGs using the Prokka files (faa and fna) as input, you only get COG protein sequences, but no protein-coding nucleotide sequences (see this related post). So I did you do it?
Anyway, I'm a bit confused because these files
while read line;do while read cogs;do cp ${line}/${cogs}*fna allfetchm_nucoutput/${line}-${cogs}.fna;done < allcogs.txt ;done <filesnames.txt
don't appear again somewhere in your pipeline, so are they really important anyway?
I'm really sorry to bother you with all these questions, but I just want to get things right.
Cheers Bastian
Hey @bheimbu ,
Sorry, the only reason I am redirecting you to other resources for creating the gtdb_ver95_alllca_taxid.csv.tar.gz
file is that I have already left the university and I no longer have access to my university's cluster to check old scripts. From what I remember I joined the metadata file from GTDB and taxdump files to create gtdb_ver95_alllca_taxid.csv.tar.gz
. It essentially adds a LCA taxonomy to each taxid. BTW, the links do work, you will have to copy and paste them. Somehow clicking on the link redirects you to the issues page of this repository.
Prokka adds _1
to each protein fasta header at the end. So the first part of the header is common between the protein and nucleotide headers. I just matched the first part. Just to verify that I was matching to the correct nucleotide header- a) manually compared the annotation of some of the nucleotide and their corresponding protein sequences, b) used emboss
online tool on some nucleotide sequences, and translated them to proteins which should be 100% identical to original protein sequences.
Yes, you are right the filenames.txt
file created from this while loop is not used again. This was just to keep track of how many files I was working with.
Let me know if you need anything!
Hi,
thanks for the clarification, I didn't know you left OIST. I'll have a second look on your provided links.
I'll see what I can do about gtdb_ver95_alllca_taxid.csv.tar.gz
.
There are certainly more questions coming, but so far so good ;)
Have a nice weekend,
Bastian
Hi @Jigyasa3,
to be honest, I'm stuck. Right now, I'm trying to combine all my files as in combiningallfiles.md
, but I'm failing on the first line.
My salmon quant files look like this:
Name Length EffectiveLength TPM NumReads
BEC328_contig1:440-1066 627 388.472 75.633197 9.000
BEC328_contig2:250-951 702 463.470 147.919989 21.000
BEC328_contig3:214-567 354 131.388 198.776314 8.000
BEC328_contig6:26-460 435 200.515 227.934551 14.000
BEC328_contig7:281-601 321 106.969 152.595815 5.000
BEC328_contig9:578-1465 888 649.470 60.318628 12.000
BEC328_contig10:7-867 861 622.470 73.424146 14.000
So tpm$fullproteinnames<-paste(tpm$file_name,tpm$gene_name,sep="_")
is not even possible, because there are no columns file_name
and gene_name
. Sometimes I'm really thinking where not using the same software versions?
How do your fullproteinnames
actually look like -- I'm just curious?!
Cheers Bastian
ps: This file also makes me wonder cogs<-read.csv("/bucket/BourguignonU/Jigs_backup/working_files/AIMS/paper1/markergenes/markers-rpkm/individualanalysis_feb2021/allcogs-allsamples-finalkrakenoutput.csv")
as you mentioned before that diamond
not kraken2
was used, actually.
Hi @bheimbu ,
Yes, I think the software versions are different. But you can still run this code coz the data is similar although the names are different.
To run tpm$fullproteinnames<-paste(tpm$file_name,tpm$gene_name,sep="_")
, you can add the filename to your file using -
awk ' { print FILENAME","$0} ' your_tpm_file_name > your_new_tpm_file_name
Sorry, as I said before I don't have access to the intermediate files as I am not at OIST anymore. But the final files generated from these scripts are publicly available if it helps- https://figshare.com/articles/dataset/Tables_for_main_figures/19173407
Thx for letting me know,
will try your suggestions tmrrw. Have a good one,
Bastian
Hi,
it's been a while. I hope you're fine and preparing for the holidays. I have a question:
BLASTp analysis against ANNOTREE database
#The "all-wood-gtdb.fasta.dmnd" created by adding proteinsequences from ANNOTREE database corresponding to gene(s) of interest-
diamond blastp --db ${DB_DIR}/all-wood-gtdb.fasta.dmnd --query ${IN_DIR}/${file1} --outfmt 6 --out ${OUT_DIR}/wood-gtdb-matches-${file1}.txt --threads 15
Where is the `all-wood-gtdb.fasta.dmnd` coming from? I'll try with this [database](https://software-ab.cs.uni-tuebingen.de/download/megan-annotree/welcome.html), and it works but I cannot relate the results to my kofam results as this outputs no KeggIDs only "gene_id" and "gtdb_id".
Cheers Bastian
Hi,
a different thing. I'd like to publish a snakemake workflow using some of your scripts (adjusted to my needs). That's why, I'd like to ask you if you want to be a co-author? Let me know your decision.
If not, I'll clearly state that your code was used extensively.
Cheers Bastian
Hi @bheimbu ,
Thanks for the message! Sorry, I was very busy during and after the holidays! The Annotree KEGGids and sequences are coming from herehttp://annotree.uwaterloo.ca/annotree/app/. If you search for a KEGGID of interest, Annotree has the option to download a CSV file that contains the KEGGID, protein sequence, bacterialID etc. You can then extract the protein sequence and KEGGID in fasta format use that as a database for diamond blastp
I am super excited that you are ready to publish the Snakemake workflow! I would request to add my name to the codes that were used verbatim from my GitHub repository. But otherwise, you are welcome to acknowledge my name. Could you please also add the paper name associated with my codes for reference?
Good luck!
Hi,
I'm happy that you are on board. Can you give me your address details and ORCID ID (via email at bheimbu@gwdg.de)?
Of course, I will reference you. I'm preparing the manuscript right now and would be happy if you would provide some comments and feedback once it is finished.
Cheers Bastian
Hi,
I'm wondering where
gtdb_ver95_alllca_taxid.csv.tar.gz
is coming from? Have you written it by yourself or downloaded it somewhere? I'm using your pipeline to analyze the microbiome of Australian termites. But I want to use GTDB ver202 or later.Additionally, I'd like to know where the gtf file (
"/bucket/BourguignonU/Jigs_backup/working_files/AIMS/AIM2/tpm_functional_annotation/functional_annotation/all_functions_all_taxonomy/gtf_files_Dec2019/named-gtffiles/filename-230-13-prokka.map.gtf
) is coming from, please. Found in hpc_tpmcal.md. Have you used this code snippet to do it?Also, it is not clear to me, where this file (
/bucket/BourguignonU/Jigs_backup/working_files/AIMS/paper1/markergenes/markers-rpkm/individualanalysis_feb2021/all-samples-prokTPM.txt"
) from here is coming from.Cheers Bastian