qunfengdong / BLCA

34 stars 12 forks source link

BLCA scoring always seems to return "100" at species level score #17

Closed dswan closed 5 years ago

dswan commented 5 years ago

I don't know if this is a feature of my dataset or not. I have a dataset of ~100k sequences from various sources.

I'm getting an identification from a particular sequence:

Consensus superkingdom:Bacteria;100.0;phylum:Firmicutes;100.0;class:Bacilli;100.0;order:Bacillales;100.0;family:Bacillaceae;100.0;genus:Bacillus;100.0;species:Bacillus zhangzhouensis;100.0;

But actually this is a very short sequence which matches equally well:

Top 2 hits from blast file:

Consensus       1ef276d07f3045c7293275c749dc254a        100.000 599     0       0       1       599     20      618     0.0     1107
        599     plus    1513    599
Consensus       a6cf5d187e2a898b1f278069046fef6d        100.000 599     0       0       1       599     20      618     0.0     1107
        599     plus    1529    599

Bottom 2 hits from blast file:

Consensus       dd9a3ce113e3dc10a53045dcce4682fb        100.000 598     0       0       2       599     1       598     0.0     1105
        598     plus    1474    599
Consensus       ba471a3dc4742f1e5440a0d9de8e9283        100.000 598     0       0       2       599     1       598     0.0     1105
        598     plus    1474    599

This is a sequence I would have expected more 'uncertainty' over and potentially not to be reporting back "100" values at the full taxonomy (I would expect this back to Bacillus genus).

Am I misinterpreting the scoring, and is there anything I'm missing with regards to the analysis?

Are there any concerns with using a full length gene database with a shorter input query?

qunfengdong commented 5 years ago

That's odd. Do you mind sending the fasta sequences for particular query sequence and database hits to us (qunfengd@gmail.com). We'd like to repeat it. The only reason I can think of now is that the other hits do not have good alignment coverage as in the top hits.

On Wed, May 29, 2019 at 6:57 AM Dr. Daniel Swan notifications@github.com wrote:

I don't know if this is a feature of my dataset or not. I have a dataset of ~100k sequences from various sources.

I'm getting an identification from a particular sequence:

Consensus superkingdom:Bacteria;100.0;phylum:Firmicutes;100.0;class:Bacilli;100.0;order:Bacillales;100.0;family:Bacillaceae;100.0;genus:Bacillus;100.0;species:Bacillus zhangzhouensis;100.0;

But actually this is a very short sequence which matches equally well:

Top 2 hits from blast file:

Consensus 1ef276d07f3045c7293275c749dc254a 100.000 599 0 0 1 599 20 618 0.0 1107 599 plus 1513 599 Consensus a6cf5d187e2a898b1f278069046fef6d 100.000 599 0 0 1 599 20 618 0.0 1107 599 plus 1529 599

Bottom 2 hits from blast file:

Consensus dd9a3ce113e3dc10a53045dcce4682fb 100.000 598 0 0 2 599 1 598 0.0 1105 598 plus 1474 599 Consensus ba471a3dc4742f1e5440a0d9de8e9283 100.000 598 0 0 2 599 1 598 0.0 1105 598 plus 1474 599

This is a sequence I would have expected more 'uncertainty' over and potentially not to be reporting back "100" values at the full taxonomy (I would expect this back to Bacillus genus).

Am I misinterpreting the scoring, and is there anything I'm missing with regards to the analysis?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/qunfengdong/BLCA/issues/17?email_source=notifications&email_token=AEOBXE2GD5TWU53ZW7FQTPDPXZVUFA5CNFSM4HQMAL5KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GWO3U6Q, or mute the thread https://github.com/notifications/unsubscribe-auth/AEOBXE7A272KZZ73NIRS2QLPXZVUFANCNFSM4HQMAL5A .

dswan commented 5 years ago

I have included some files to you via email - thanks!

dswan commented 5 years ago

OK some further digging, this issue isn't present in fb2bd12 - using python2.7 and muscle. It appears in 32c8cf7 with clustalo and python3. I'm using the same input file (the standard test.fasta) and NCBI database, and only in release fb2bd12 do I get scores <100 at the species level

dswan commented 5 years ago

There may have been some unintended consequences of the move to python3. The python2.7 "release" that shares the core code with the python3 release still has issue #13 (or similar) on the default NCBI database run with python2.7 but fb2bd12 runs fine.

yingeddi2008 commented 5 years ago

Hi Dr. Swan,

Thanks for taking the effort to dig into this. I found out this python version problem too, and I am in the process of fixing it. It seems to be a python 3 dictionary issue. I am thinking to use the defaultdict from collections package. I am still in the process of testing and implementing.

Thanks again for your help, I will keep you posted when this bug is resolved.

Best,

Eddi

On Mon, Jun 3, 2019 at 5:14 AM Dr. Daniel Swan notifications@github.com wrote:

There may have been some unintended consequences of the move to python3. The python2.7 "release" that shares the core code with the python3 release still has issue #13 https://github.com/qunfengdong/BLCA/issues/13 or similar when run with python2.7 but fb2bd12 https://github.com/qunfengdong/BLCA/commit/fb2bd12afc1dbf2c67c82a9c9d5aad91783034b5 runs fine.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/qunfengdong/BLCA/issues/17?email_source=notifications&email_token=ABYI7U57TD5AIRTSVJTIT23PYTVJDA5CNFSM4HQMAL5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWY6OCI#issuecomment-498198281, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYI7UYAKWRJBJG2C5JQ2E3PYTVJDANCNFSM4HQMAL5A .

qunfengdong commented 5 years ago

Hi Daniel - thanks so much for your efforts. We have fixed the hidden bug and github is now updated.

On Mon, Jun 3, 2019 at 5:14 AM Dr. Daniel Swan notifications@github.com wrote:

There may have been some unintended consequences of the move to python3. The python2.7 "release" that shares the core code with the python3 release still has issue #13 https://github.com/qunfengdong/BLCA/issues/13 or similar when run with python2.7 but fb2bd12 https://github.com/qunfengdong/BLCA/commit/fb2bd12afc1dbf2c67c82a9c9d5aad91783034b5 runs fine.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/qunfengdong/BLCA/issues/17?email_source=notifications&email_token=AEOBXE5MH6ZV44SX5EY6OJLPYTVJDA5CNFSM4HQMAL5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWY6OCI#issuecomment-498198281, or mute the thread https://github.com/notifications/unsubscribe-auth/AEOBXE4NZWFVQZYGZ2UZZFDPYTVJDANCNFSM4HQMAL5A .

dswan commented 5 years ago

@yingeddi2008 @qunfengdong - I can confirm that this is fixed at my end too and it now works with python 2.7 and 3.6 - thanks so much for your prompt attention!

There might be a little issue with some of the floating point maths:

Bacillus_test superkingdom:Bacteria;100.00000000000001;phylum:Firmicutes;100.00000000000001;class:Bacilli;100.00000000000001;order:Bacillales;100.00000000000001;family:Bacillaceae;100.00000000000001;genus:Bacillus;100.00000000000001;species:Bacillus subtilis;92.00000000000001;

But I can certainly work around that for now!