jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
363 stars 78 forks source link

Regarding make_databases.pl and taxonomic annotations #172

Closed ShailNair closed 3 years ago

ShailNair commented 3 years ago

HI, I am trying to make the database from source using the make_databses.pl. But, I am confused about why there is a need to download the preformatted database file as shown in the make _databses.pl code:

Download general db tarball. (-U '' so that we give the server an user agent string, it complains otherwise)

print "Downloading and unpacking general database tarball...\n"; system("wget -U '' -P $download_dir http://wwwuser.cnb.csic.es/~squeezem/db.tar.gz; tar -xvzf $download_dir/db.tar.gz -C $download_dir; rm $download_dir/db.tar.gz");

Or is it just for reference to create the new database?

fpusan commented 3 years ago

It is a tarball with several small files that do not change, so we redistribute them on one go. In a further step NCBI's nr, eggNOG4 and PFAM downloaded from source and compiled into the DIAMOND format.

ShailNair commented 3 years ago

Got it. Recently I ran squeezemeta pipeline with default settings (with -Doublepass option). Unfortunately, at lower taxonomic levels (like order, genus) a lot of reads are unclassified. Even the ones which are most abundant in my experiment (My experiment consists of an artificial bacterial community, wherein we selected some bacterial genera and co-cultured together). I ran a preliminary taxonomic analysis using kraken2, kaiju, and than sunbeam metagenomic pipeline (which in fact uses the same kraken2 for classification) All of this gave expected results (but with a few false positives)as was confirmed from 16s amplicon sequencing, and agar plating. I want to go with the squeezemeta pipeline as first of all it didn't give me any false positives (though the problem of not classifying remains), secondly, it is a complete pipeline from raw reads to final visualization (easy to run on many samples)and includes many extra important features,

So, I thought maybe I can build the database from source and try classifying again. Do you think that might be the problem?

fpusan commented 3 years ago

No, it's not the database's fault.

Rather, SqueezeMeta is very conservative when assigning taxonomies. A brief explanation of our algorithm can be found in #79, a more elaborate one in the PDF manual.

Note that SqueezeMeta is designed with environmental metagenomes in mind, in which the genomes in the community can and will differ from the reference genomes present in your databases. In that scenario, we've found that solutions like Kraken2 might make overpredictions (https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-6289-6). In our opinion, this leads to a false sense of confidence. Users will see genus-level classifications and naturally use them as a basis for writing their papers, even if those classifications are not really that accurate. We've chosen to go the opposite way: SqueezeMeta will only report a taxonomy if we're quite sure that it is correct (Last Common Ancestor + Identity cutoffs).

This said, since you are working with a co-culture of (I assume) well characterized genera, the problems I mentioned above would likely not happen for you. I will make it so that the annotations without identity cutoffs can optionally be loaded into SQMtools. This will make it into the next version, if you want to have the feature beforehand let me know and I'll instruct you on the (minor) modifications required to get it.

ShailNair commented 3 years ago

Yes. That will be great if I can try without cutoff and if that's going to refine unclassified taxa. My whole hypothesis (regarding my experiment) is based on a 3 dominant genera within the sample. From other analysis like 16s rRNA gene, agar plating) I get expected results (as majority reads, colonies taken by these 3 genera). from Kraken also the result is quite similar, though with some false classifications. Squeezemeta works nicely till class level, but down the taxa most reads are unclassified. As you said, squeezemeta is more stringent (while being more accurate), might be the reason.

Attched are the images from 16S, kraken2 and squeezemeta classification (same sample) 16s k2 sqz

fpusan commented 3 years ago

Ok, let's do this. First open the sqm2tables.py script (you can locate within the conda environment with which sqm2tables.py. Below line 157 (assuming you have v1.2) add the following two lines:

tax_abunds_orfs = aggregate_tax_abunds(orfs['abundances'], orf_tax_nofilter_wranks, idx)
write_row_dict(sampleNames, tax_abunds_orfs, prefix + '{}.nofilter.abund.tsv'.format(rank))

Make sure to respect the same indentation as in L157 (with spaces, not tabs!).

Then re-run sqm2tables.py in your project. The output folder should now contain an extra set of tables (with the nofilter suffix) containing the unfiltered taxonomy for the different ranks. Check whether the genus-level taxonomy looks more like you would expect. If things look nice, we can then work on how to load that into SQMtools.

ShailNair commented 3 years ago

Yes, the modification assigned the unclassified_Alphaproteobacteria and most of the others (Though Unclassified Hyphomonadaceae, unclassified_Rhodobacterales remain unchanged). But, now the false positives also showed up (which was not present earlier). These are some closely related genera of autotrophic cyanobacteria. My experiment consists of only one cyanobacterial sps. (As the initial experiment was started with a pure cyanobacterial sps. inoculated with a group of heterotrophic bacteria. Time to time we have checked for contamination via microscopy, flowcytometry, 16S sequencing, and plating).

I guess it will be difficult to get the exact results from the metagenome as the preliminary tests via kraken2, kaiju and metagenomic analysis (sunbeam pipeline, manual analysis) too showed up the same problem.

fpusan commented 3 years ago

I'm not surprised. The taxonomic resolution of shotgun metagenomics is far inferior to that of 16S (or other marker gene) approaches.

If you want to load the new results in SQMtools you can try the following: 1) Edit the file in /path/to/SqueezeMeta/lib/SQMtools/R/loadSQM.R 2) In line 117, add 'nofilter' to the vector of allowed tax_modes. 3) Reinstall SQMtools with R CMD INSTALL /path/to/SqueezeMeta/lib/SQMtools (make sure to have the SqueezeMeta environment activated!). 3) You can now load your project with proj = loadSQM('/path/to/project', tax_mode='nofilter')

ShailNair commented 3 years ago

Yes, indeed. what's more astonishing is the microbes in these samples aren't as diverse as in natural habitat. Still, the outcome is like this. Anyways, I will run squeezemeta with default filters on my other samples. Will see how the results come out. Then can choose whether to go with the default settings or with the above modification.

Another question is, From the taxonomic tables, I saw that squeezemeta also able to classify other kingdoms like eukaryotes, viruses. My question is how reliable efficient squeezemeta is with regards to no-bacterial reads. Can it be used for non-bacterial samples?

From my samples The major portion was classified as bacteria, some as eukaryotes and some viruses. None of the reads were classified as archea.

  original_1 original_2
k_Bacteria 70595482 70258642
k_Eukaryota 73873 68946
k_Unclassified 170204 174716
k_Viruses 425 456

What are these numbers? are they number of reads classified into each taxon?

fpusan commented 3 years ago

Yes, it is the number of reads classified into it taxon.

Regarding other taxa, in our experience eukaryotes tend to be problematic. They are not as well represented in the databases as prokaryotic taxa, so on average hit identities are lower, which leads to poorer taxonomic classifications (again, see #79).

the microbes in these samples aren't as diverse as in natural habitat

Now this is a bit tricky. First, it is not about diversity, but rather about whether the taxa in your samples are well represented in the database. Secondly, SqueezeMeta (as well as other tools, such as MEGAN) uses a Last Common Ancestor algorithm to assign taxonomies. This means that, even if the genomes in your sample are well represented in the database, you might get unresolved taxonomies: e.g. if a protein is conserved at the family level it will have hits against different genera from that family, so taxonomy might not reach genus-level resolution.

Rather than discussing phylotypes (e.g. "my data has three dominant genera") it might be useful to discuss genomes/bins ("my data has three dominant genomes). SqueezeMeta also provides binning results, and ways to explore those using R and anvi'o. Binning comes with its own set of issues and caveats, but since your community is not complex, you might be able to curate your bins manually, which IMO is the best scenario when using binning.

EDIT: as a bonus, once you've delineated a bin its easy to determine its taxonomy by looking at its 16S/other markers.

ShailNair commented 3 years ago

Yup. That is correct. From my knowledge, (very amateur at genomics) the constructed bins will be very few when compared with the whole metagenome. So, it will be not feasible to go for taxonomy using bins (though fewer but the assigned taxonomy will be accurate). Yes, bins are very important to find more detail about representative bacteria like investigating even at strain level and to find functional genes within that bacteria.

ShailNair commented 3 years ago

Hi, Well, i have finally finished running all the samples via squeezemeta. All over the results are good with what we had expected, except, that the taxonomic classification problem remains wherein with filter many reads are unclassified at lower taxon levels e.g. at family, genus level. But the good thing is, it involves almost null false positives. With the "no filter" option I can get some of these unclassified reads classified into respective taxa but the results are still not satisfactory as compared to other classifiers (kraken2, kaiju). But in comparison to these other tested classifiers squeezemeta gave (with no filters) very less false classification at genus level (with prokfilter, false classification was almost null).

I am not writing these to criticize anything or anyone, just want to mention so that it might help in refining the future versions (if at all some other people too experience a similar problem). As, you have explained above and with my little knowledge, I am aware of the complexities it involves and how every sample is complex and different. Still, i am happy to use squeezemeta for its simplicity, automation, and easy to use results and will continue to implement squeezemeta in future analysis.

Secondly, is it possible to retrieve a particular sequence which was used by diamond to classify taxonomy?

Thanks to the whole Squeezemeta team for the good work.

fpusan commented 3 years ago

Regarding the annotation, another option would be to modify the hardcoded parameters in /path/to/SqueezeMeta/scripts/parameters.pl (lines 12 to 16), which affect how the LCA algorithm is run. They are set to sensible values by default, but we haven't really experimented a lot on what happens when you change them. Perhaps @jtamames can guide you on that part.

Secondly, is it possible to retrieve a particular sequence which was used by diamond to classify taxonomy?

Sure. DIAMOND results are stored in /path/to/Project/intermediate/04.*.nr.diamond. Just grep your ORF against that file and you should get the NCBI IDs of all the hits.

ShailNair commented 3 years ago

thanks. parameters.pl have these parameters set by default. perhaps, @jtamames might explain to me more about it.

$scoreratio6=0.8; #-- STEP6/8: Ratio first score/currsent score for the hit to be considered $diffiden6=10; #-- STEP6/8: Maximim identity difference with the first $flex6=0.2; #-- STEP6/8: Allows this PERCENTAGE (if less than one) or NUMBER (if greater than one) of hits from different taxa than LCA $minhits6=2; #-- STEP6/8: Minimum number of hits for the taxa (if there is only one valid hit, this value sets to one automatically $noidfilter6=1; #-- STEP6/8: Set to 1, creates a new set of results with no identity filters

Sorry for all the trouble.

jtamames commented 3 years ago

Hello Some details about this algorithm (and others) at given in the SqueezeMeta manual. It says:

We use a Last Common Ancestor (LCA) algorithm to assign taxa to genes. For the amino acid sequence of each gene, DIAMOND (blastp) homology searches are done against the GenBank nr database (updated weekly). A e-value cutoff of 1e-03 is set by default. The best hit is obtained, and then we select a range of hits (valid hits) having at least 80% of the bitscore of the best hit and differing in less than 10% identity also with the best hit (these values can be set). The LCA of all these hits is obtained, that is, the taxon common to all hits. This LCA can be found at diverse taxonomic ranks (from phylum to species). We allow some flexibility in the definition of LCA: a small number of hits belonging to other taxa than the LCA can be allowed. In this way we deal with putative transfer events, or incorrect annotations in the database. This value is by default 10% of the total number of valid hits, but can be set by the user. Also, the minimum number of hits to the LCA taxa can be set.

So these values correspond to $scoreratio6 (80%) and $diffiden6 (10%). $flex6 corresponds to that mentioned flexibility: it's the percentage of hits that can belong to a different taxa than the consensus (but it is set to 20%, not 10% as the manual says). Besides, $minhits6 sets the minimum number of hits.

As Fernando says, we have not played enough with these values as to have a clear idea on the effect of changing them.

Best, Javier

ShailNair commented 3 years ago

Thanks @jtamames , i will try to see if I can fiddle around it. @fpusan i have applied nofilter for multiple projects using- tax_abunds_orfs = aggregate_tax_abunds(orfs['abundances'], orf_tax_nofilter_wranks, idx) write_row_dict(sampleNames, tax_abunds_orfs, prefix + '{}.nofilter.abund.tsv'.format(rank))

How to parse all the nofilter tables into one in combine-sqm-tables. I tried to put an extra set of command below line 181 in combine-sqm-tables.py with

parse_table('{}/results/tables/{}.phylum.nofilter.abund.tsv'.format(projPath, projName), all_phylum) parse_table('{}/results/tables/{}.class.nofilter.abund.tsv'.format(projPath, projName), all_class) ................................................

got error- Error where parsing table in "{}". Sample "{}" appears more than once in your input tables.'.format(path, sample))

May be something else needs to be added.

ShailNair commented 3 years ago

@fpusan Hi, When loading a single project into sqmtools ( with nofilter added to loadSQM.R) I get following error:

OB = loadSQM('/home/ahail/Documents/squeezemeta_analysis/original_bottle', tax_mode = "nofilter") Loading orfs table... abundances... sequences taxonomy... Loading contigs table... abundances... sequences... taxonomy... binning info... Loading bins table... abundances... taxonomy... Loading taxonomies Loading functions cannot open file '/home/ahail/Documents/squeezemeta_analysis/original_bottle/results/tables/original_bottle.KO.cov.tsv': No such file or directoryError in file(file, "rt") : cannot open the connection

On manual checkup, I did not find original_bottle.KO.cov.tsv in the tables directory generated by sqm2tables.py.

I had created run the project till sqm2tables.py on office server and then moved the project directory (without raw files, data ,temp) to my personal pc wherein I installed the SQMtools R package (with said modification for NOFILTER)

You had mentioned " (make sure to have the SqueezeMeta environment activated!). does it mean I need to install squeezemeta conda environment on my pc too, to work with sqmtools?".

fpusan commented 3 years ago

Hi,

Regarding your last error, it is because you currently have some kind of hybrid between v1.2 and our develop branch. You need to download the sqm2tables.py script from the develop branch and re-run it in your project.

Regarding the issue with combine-sqm-tables, I'll add support for the nofilter taxonomy in the next version. For now you can get the same result by loading each project individually in SQMtools as we discussed, and combining them with the combineSQMlite function.

ShailNair commented 3 years ago

Yes, that is spot on. i had installed sqmtools dev version. thanks. i will re-run sqm2tables.py with dev version

fpusan commented 3 years ago

combine-sqm-tables.py and loadSQMlite now support the use of the nofilter taxonomy as of 40dbd12. I'd still recommend to load each project individually using loadSQM and then combine them with combineSQMlite, as it gives you the option to subsetting your data before combining it.

ShailNair commented 3 years ago

is it possible to install it via conda. On server, I do not have sudo right (unfortunately some of the third party packages are not installed on the server.). If not, that's fine... will try when the new stable conda package is released. Thanks

fpusan commented 3 years ago

I just uploaded a developement version to conda with the latest updates. Get it with

conda create -n SqueezeMeta -c conda-forge -c bioconda -c fpusan squeezemeta-dev

ShailNair commented 3 years ago

Thanks a lot for the quick update. Does this version involves any other modifications, do I need to run the project again or the results will be same?

i installed the dev version using above command. In the conda installed packages it says version- squeezemeta-dev-1.3beta2-pl526r36_0

But the SqueezeMeta.pl -v output shows 1.2.0

When I tried to run sqm2tables.py ( with existing project- created with stable 1.2.0) it gave me an error:

Traceback (most recent call last): File "/home/mcs/miniconda3/envs/squeezemeta/bin/sqm2tables.py", line 195, in main(parse_args()) File "/home/mcs/miniconda3/envs/squeezemeta/bin/sqm2tables.py", line 58, in main total_reads, total_bases = parse_mappingstat(perlVars['$mappingstat']) KeyError: '$mappingstat'

seems some problem while parsing mappingstat file. I tried with 3 different existing projects, all showed the same error

ShailNair commented 3 years ago

@fpusan any solution?

fpusan commented 3 years ago

Didn't notice your last post.

Sure, just edit the SqueezeMeta_conf.pl file inside each of your project's directories and add the following below L55.

$mappingstat = "$resultpath/10.$projectname.mappingstat"; #-- From mapsamples.pl, mapping statistics for all samples

Newly created projects will not need this.

ShailNair commented 3 years ago

yup.that worked. thanks

ShailNair commented 3 years ago

@fpusan @jtamames Hi, This question might be out of the coverage of squeezemeta, but still thought you might give some suggestions. In my metagenomic analysis, I am basically interested in nitrogen metabolism and related taxa. From our other experiments (isolation of nitrogen potential nitrogen-fixing bacteria via special nitrogen-free media, Acetylene reduction assay, NifH gene amplification via PCR) we strongly think that the sample should have nitrogen-fixing microbes. After running squeezemeta and exploring the taxonomic and functional profiles, I got most of the genes of interest. But for nif-H gene, the functional annotation has confusing results.

According to COG and PFAM database, a portion of the annotated genes are related to Nitrogenase subunit NifH (ATPase) while kegg classifies the same protein sequence as chlorophyllide a reductase subunit X and light-independent protochlorophyllide reductase subunit L.

capture-20201028-165513

I manually checked the protein sequence via NCBI BLASTP and the sequence is closely related to both the genes. From literature- "light-independent protochlorophyllide reductase complex share a surprisingly high degree of protein sequence similarity with the nitrogenase subunits encoded by nifHDK".

So, my problem is when I subset taxa related to this particular gene, the resultant table also include other taxa having chlorophyllide a reductase subunit X gene. For eg., it shows rhizobium (with niH gene) and other prokaryotes ( having protochlorophyllide reductase subunit L). Which impossible to differentiate.

It will be helpful if you can give any suggestions if i can get the targeted taxa from squeezemeta. Similarly, As I understand, squeezemeta uses ORF (protein) sequence to assign taxonomy and function. Is there by any chance can I get the nucleotide sequence (not protein sequence) of the annotated function?

jtamames commented 3 years ago

Hello Yes, I myself have worked with these genes in the past and they are difficult to distinguish. Unfortunately a simple homology search ad SqueezeMeta does is not enough to differentiate both. That´s why we emphasize that SqueezeMeta results are the starting point of the analysis. You have still to curate the results focusing on analyzing carefully the genes you are interested in. Same thing happens, for instance, with bins. Once you are done, you can add your manual curation to SqueezeMetra tables.

I don´t remember if these very similar activites belong to different COG or KEGG ids. You can check the annotations of functions in table 13. If you see that there is a " " after the COG/KEGG id, it means that the annotation is supported by the best hit and the best average scores (see the manual for details on this). If the " " is missing, only best hit supports the annotation, meaning that you must be extra careful because it is an indication of the presence of two intertangled families.

The nucleotide sequeces of ORFs are stored in the 03.projectname.fna file, you can have them from there. But also using sqmtools can help you on that, rigth @fpusan?

Best, Javier

ShailNair commented 3 years ago

Thanks. Both the KEGG and COG annotation for this particular gene have the " * ". currently, I am focusing on manual curation. Also, I am planning to amplify the target DNA region using primers followed by amplicon sequencing. (I did amplify nifH in some of the isolated bacteria from the same experiment previously and NT blast confirmed the nifH sequence). BTW, how can I add manual curation to squuezemeta tables? Do you mean by, once I manually confirm a gene, I change the annotation in the squeezemeta table?

fpusan commented 3 years ago

Actually, SQMtools does not store nucleotide sequences for individual genes, only for the complete contigs. They can still be found in the files Javier mentioned, or you can just get the contig sequence in SQMtools and use a bit of R to select the appropriate region.

In order to add manual curation to squeezemeta results, I guess the safest way would be to manually modify table 13, and then repeat sqm2tables.py. If you are interested in a particular function or set of functions (e.g. NifH), you could also build a custom database and pass it to SqueezeMeta with the --extdb argument.

ShailNair commented 3 years ago

That's wonderful. so, if I provide an external database from which step I can restart the project? (or do I need to start the project again?)

ShailNair commented 3 years ago

After combining multiple projects using combine-sqm-tables.py i get negative values for some unmapped taxa read number.

capture-20201029-154906

fpusan commented 3 years ago

Yeah, this likely because you have a development version of SQM. I think this error should already be fixed in the latest dev version. 1) First, remove all the results/tables directories from all the projects you are combining. 2) Download the latest dev version from conda (squeezemeta-dev package). 3) Run again combine-sqm-tables.py and let me know if the error is fixed.

ShailNair commented 3 years ago

ok. will try .

ShailNair commented 3 years ago

@fpusan Hi, I am exploring more into squeezemeta. How can I retrieve the amino acid /nucleotide sequence (not NCBI accession number) assigned to a particular taxonomy (while assigning taxonomy to genes). I want to manually curate a few assigned taxonomy.

Thanks,

fpusan commented 3 years ago

In order to retrieve aminoacid sequences for a given set of taxa/functions, I would use subsetTax and/or subsetFun to select the appropriate ORFs, and then look in subSQM$orfs$seqs in the resulting subsetted object. If I wanted to get the corresponding nucleotide sequences, I would retrieve the contig sequence in subSQM$contig$seqs and select the relevant positions in the contig.

ShailNair commented 3 years ago

Thanks

fpusan commented 3 years ago

I will be closing this issue now, as the original question was solved and the discussion has moved away from the original topic. Feel free to open a new one if you need further support.