merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
441 stars 146 forks source link

C/R estimation bug for bins imported via anvi-import-collection --contigs-mode #516

Closed adityabandla closed 7 years ago

adityabandla commented 7 years ago

Hi

Although, I understand CheckM and ANVIO use different sets of SCGs to estimate genome completeness and precision, I was surprised by the amount by which the estimates varied

I binned my contigs using METABAT and subsetted bins that had completion greater than 50% and precision greater than 90% (using CheckM). However, when I analyzed this collection using ANVIO, none of the bins had completion estimates above 50%

Has anyone encountered this before?

Regards, Aditya

adityabandla commented 7 years ago

An update

Just to verify what was going on, I did the following

Used CheckM to independently estimate the completeness and redundancy using the Campbell et al BSCG set and Dupont et al BSCG set

Results from both the sets agreed closely with the estimates suggested initially by CheckM

So I am confused about the C/R estimates that ANVIO is throwing up

Regards Aditya

meren commented 7 years ago

Hi Aditya,

It would be very useful if you can send some of the interesting FASTA files you see a big discrepancy so we can test them. You can attach them to your comment from the GitHub interface.

Thanks,

adityabandla commented 7 years ago

Hi Meren

Can I email them over to you instead?

Regards Aditya

meren commented 7 years ago

Sure :)

adityabandla commented 7 years ago

Just did!

meren commented 7 years ago

No email yet. Dropbox could be a better option.

On May 14, 2017 11:03 AM, "Aditya Bandla" notifications@github.com wrote:

Just did!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/merenlab/anvio/issues/516#issuecomment-301322156, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMCuzjUmX_dGlOp1lswAO-sBE4plSdQks5r5yXNgaJpZM4NaWOL .

adityabandla commented 7 years ago

My bad :/

Just resent the email and shared the files on Dropbox as well

meren commented 7 years ago

I don't understand. These are the bins you sent as the ones that anvi'o estimates to be less than 50%, and this is what I get:

anvi'o also predicts bin-705 to be a CPR genome, so 70% completion with bacterial single-copy genes is as good as it gets, and the size seems appropriate.

$ anvi-script-FASTA-to-contigs-db bin-257.fa
$ anvi-script-get-collection-info -c bin-257.db

Genome in "bin-257.db"
===============================================
bin-257 :: PC: 100.00%, PR: 0.00%, N: 158, S: 3,046,041, D: bacteria (1.00)
$ anvi-script-FASTA-to-contigs-db bin-705.fa
$ anvi-script-get-collection-info -c bin-705.db

Genome in "bin-705.db"
===============================================
bin-705 :: PC: 70.50%, PR: 2.16%, N: 61, S: 549,742, D: bacteria (0.73)

Am I missing something?

adityabandla commented 7 years ago

Hi Meren

Thats weird. When I ran the CPR-classifier, it gave me C/R of 55.4/0 for bin.257 and 32.3/0.72 for bin.705

I got the same values for these bins when I ran anvi-refine with the --quick-summary flag

Will add the folders from the anvi-refine step to the Dropbox folder. Even more confusing now

meren commented 7 years ago

What version of anvi'o are you using? Just to make sure we are on the same page.

adityabandla commented 7 years ago

v2.3.1

meren commented 7 years ago

This is with the master:

bin-257 ......................................: NOT CPR (Conf: 100%, Size: 3,046,041, C/R: 100%/0%)
bin-705 ......................................: CPR (Conf: 100%, Size: 549,742, C/R: 71%/2%)
$ anvi-script-predict-CPR-genomes cpr-bscg.classifier -c ~/Dropbox/anvio-hmm/bin-257.db

WARNING
===============================================
You did not provide a collection name. Anvi'o will assume that what is in your
contigs database is a a single genome (or genome bin).

Auxiliary Data ...............................: Found: /Users/meren/Dropbox/anvio-hmm/bin-257.h5 (v. 1)
Contigs DB ...................................: Initialized: /Users/meren/Dropbox/anvio-hmm/bin-257.db (v. 8)
Classifier ...................................: Initialized with 139 features grouped into 2 classes.
Num samples to classify ......................: 1.

Genome in "bin-257.db"
===============================================
bin-257 ......................................: NOT CPR (Conf: 100%, Size: 3,046,041, C/R: 100%/0%)

The only thing changed was a small fix, so I don't see why it wouldn't give the same result in v2.3.1, I feel like there is something wrong with your anvi'o copy :/

meren commented 7 years ago

Tom suggests that your anvi-run-hmms may have ended prematurely. Can you please re-run it and make sure you are getting the same results?

adityabandla commented 7 years ago

Hi Meren

Do I need to erase the results from the previous hmm run somehow before I run anvi-run-hmms again?

meren commented 7 years ago

No, when you run anvi-run-hmms it will replace the previous ones.

adityabandla commented 7 years ago

Okay, I am re-running them. It might take a while, since I still have the issue of hmmscan using just 2 cpus, even though I ask it to use 20 :(

meren commented 7 years ago

Yes, but there is nothing anvi'o can do about that one. It is an issue for your systems administrator to look into :/

adityabandla commented 7 years ago

I guess it might not be an issue with the system itself, as I verified the same with another server, for which I have sudo access.

My guess right now, is that it might be a limitation of hmmscan depending on the size of the input file. In my case, the contigs database has close to 7 million sequences.

I could be wrong about my conjecture about hmmscan as well

meren commented 7 years ago

another server with a different operating system?

adityabandla commented 7 years ago

Both servers are Linux based, but with different architectures Linux 2.6.32-504.12.2.el6.x86_64 and Linux 2.6.32-642.11.1.el6.x86_64

meren commented 7 years ago

Those kernels look pretty similar to me ;) This is not an anvi'o issue. Something somewhere needs to be recompiled with a different flag (unless hmmscan is able to multithread properly). Nevertheless. Out of topic. Let's go back to the original issue and make sure we understand whether there is a discordant C/R estimates between CheckM and anvi'o.

adityabandla commented 7 years ago

The HMMs just got done. Its the same issue again. The C/R estimates haven't changed

meren commented 7 years ago

This is very interesting :( but without being able to reproduce the error independently, I don't know I can address it.

adityabandla commented 7 years ago

I even tried uninstalling anvio completely and then did a complete reinstall from the master repo. Still I end up with this issue

adityabandla commented 7 years ago

@meren I just checked the database versions for the contigs, samples and merged profiles using the command you had suggested

for i in latest-db/final.contigs.fixed.db; do echo $i; sqlite3 $i 'select * from self;'; echo; done

They seem to be different. Is this normal? Also, the bins were imported using the --contigs-mode. I could figure out a way to transfer these big files, if that would help :|

Update: I reproduced what you were able to do, i.e. create dummy databases from the individual bins and run hmms. The C/R estimates now seem correct. But they dont work with the contigs and profiles database that I have. Could it be possibly due to importing bins using --contigs-mode as the script to assess completeness, seemed to refer only to splits

meren commented 7 years ago

They seem to be different. Is this normal?

Yes, it is.

Could it be possibly due to importing bins using --contigs-mode as the script to assess completeness, seemed to refer only to splits

I am not sure but I would be surprised if that's the case. We need something to reproduce this :/ I will think about that.

Thank you.

adityabandla commented 7 years ago

@meren I tried to repeat the whole analysis on a Centos system, and I pretty much end up with the same bug

Thinking about the issue, could it be possibly to due some sort of soft-limits on the contig length of the bins being imported? For instance, I could see that for the profiling step, contigs of length less that 2500 bp are not considered, even though the contigs database might have contigs less that 2500bp

For my bins, I am sure that there are contigs less than 2500bp, but greater than 1000bp, generated using the ensemble binning option of METABAT

meren commented 7 years ago

Contig length can't have anything to do with this :( There is something else going on, and we haven't had a chance to look into this yet.

meren commented 7 years ago

Contig length can't have anything to do with this :( There is something else going on, and we haven't had a chance to look into this yet.

adityabandla commented 7 years ago

I just noticed another discrepancy. For bin.257, the number of contigs is 158 when this bin is analysed separately (building a contig database and then summarizing this single bin). However, when I import it into my contigs database (all of my contigs) and then summarize my collection, N is 154, while the total bin size is the same

tdelmont commented 7 years ago

Hi Aditya,

Can you confirm it is not 158 splits for 154 contigs?

Best,

Tom

On Tue, May 16, 2017 at 9:04 AM, Aditya Bandla notifications@github.com wrote:

I just noticed another discrepancy. For bin.257, the number of contigs is 158 when this bin is analysed separately (building a contig database and then summarizing this single bin). However, when I import it into my contigs database (all of my contigs) and then summarize my collection, N is 154, while the total bin size is the same

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/merenlab/anvio/issues/516#issuecomment-301792697, or mute the thread https://github.com/notifications/unsubscribe-auth/AH8mWfTpkeHLIsarQbZyQPfS00tLRQaBks5r6azygaJpZM4NaWOL .

adityabandla commented 7 years ago

Hi Tom

I assume "N" in both cases refers only to contigs? (Referring to Merens test output posted above, N is 158 for the single bin, while N is 154 when I summarize this bin along with the full contigs db)

In any case, I assume N should still be the same, considering it's the same bin in both cases?

Regards,

screen shot 2017-05-16 at 10 28 38 pm

Aditya

adityabandla commented 7 years ago

@meren @tdelmont It turned out to be an issue with the contigs database. After I regenerated it from scratch, the C/R estimates seem to be correct

Since the server and anvio did not raise any red flags, was difficult to troubleshoot :/

Thank you for all the suggestions and support

meren commented 7 years ago

Hey Aditya,

Do you still have the broken contigs db? If you can make the old and the new one available to us, we may be able to diagnose the problem and add a red flag somewhere.

Best,

adityabandla commented 7 years ago

Hi Meren

Yes, I have a copy of the broken db as well. I will figure out a way how best to share them

Best, Aditya

nvpatin commented 7 years ago

Hi all, Was this problem ever resolved in the code? I'm having the same issue, getting different completion/redundancy rates for my bins compared to CheckM values. I will re-generate the contigs database since that worked for Aditya, but I will likely do it the same way as before so not sure if anything will change. Thanks, Nastassia

meren commented 7 years ago

@nvpatin, which version of anvi'o are you using?

anvi-interactive -v
nvpatin commented 7 years ago

I generated the contigs db in 2.3.2, but just updated to 2.4.0. Was going to see if that changed things.

meren commented 7 years ago

It should :)

maglau commented 6 years ago

I understood that Anvio uses the Campbell et al BSCG set and Rinke et al ASCG set to calculate completeness and redundancy, whereas the default CheckM analysis uses a set of genomes that comes with the CheckM package. I am not surprised to see differences in the estimated C/R values obtained by the Anvio (v4) and CheckM. I am curious to know how different the C/R values should one become alarmed.