qunfengdong / BLCA

34 stars 12 forks source link

July 8 2019 release - does it REQUIRE Blast 2.9.0? #20

Open wolfgangrumpf opened 5 years ago

wolfgangrumpf commented 5 years ago

I'm considering upgrading BLCA but our cluster doesn't have BLAST 2.9 on it yet. Is 2.9 required, or will the July 8 2019 release work with BLAST 2.8?

yingeddi2008 commented 5 years ago

You are right, The July 8 2019 release should work with previous versions of blast. I made some minor changes so it could work with the latest version of blastn 2.9. Let me know if you find any problems.

wolfgangrumpf commented 5 years ago

Okay, I’m working on the install now.

Another question for you - I’m using an older release of BLCA - with an older database. Some of the results it comes up with don’t agree with a “simple” BLAST on the current NCBI database - e.g. one sequence in particular that I am working with is an ATCC E. Coli strain, which NCBI recognizes if I use their online search, but BLCA says it’s E. fergusonii. Is this a database mismatch issue, e.g. if I update to the newer BLCA and database, do you think the sequence will be recognized as E. coli?

Cheers,

Wolfgang Rumpf, Ph.D. ———————————— Bioinformatics Analyst The Institute for Genomic Medicine at The Abigail Wexner Research Institute Nationwide Children’s Hospital —————————————- Professor University of Maryland Global Campus

On Jul 16, 2019, at 11:45 AM, yingeddi2008 notifications@github.com wrote:

You are right, THe July 8 2019 release work with previous versions of blast. I made some minor changes so it could work with the latest version of blastn 2.9. Let me know if you find any problems.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

yingeddi2008 commented 5 years ago

Hi Wolfgang,

Please note that the default BLCA database is 16s rRNA, not the NT database which you are referring to when you perform BLASTN online. We have noticed some issue with the 16s rRNA database -- such as that some of the 16s rRNA fragments are not the type strains. I believe that's the reason why the annotation is off. Since we have no control over NCBI's 16s rRNA database, I can't say that updating the BLCA software will fix your misclassification issue. I do recommend that you use a manually curated database, such as greengene or SILVA instead.

I hope this helps,

Eddi

dswan commented 5 years ago

There's also a plethora of sequences in the NCBI 16S database with ambiguous nucleotides, I'd thought of applying a filter for removing some of the more egregiously poor sequences actually. It's a shame because the ITS targetted loci project at the NCBI is far better curated for quality and really focuses on type strains.

One of the things I've been meaning to dig into a little further is the provenance of these files:

ftp://ftp.ncbi.nlm.nih.gov/refseq/TargetedLoci/Bacteria/bacteria.16SrRNA.fna.gz

and

ftp://ftp.ncbi.nlm.nih.gov/refseq/TargetedLoci/Archaea/archaea.16SrRNA.fna.gz

As opposed to the pre-formatted BLAST database. Technically should be all the same project I imagine, but I've noticed a few formatting issues with the BLAST database, probably down to sequence redundancy.

(updated) Having checked these files they're similar enough to satisfy me that they're the same source!

qunfengdong commented 5 years ago

If you can remove those poor sequences in NCBI 16S database, I do believe that it'd be better. Any other ITS loci sequences should also work as long as you can compile the corresponding taxonomic annotation.

On Fri, Jul 19, 2019 at 7:28 AM Dr. Daniel Swan notifications@github.com wrote:

There's also a plethora of sequences in the NCBI 16S database with ambiguous nucleotides, I'd thought of applying a filter for removing some of the more egregiously poor sequences actually. It's a shame because the ITS targetted loci project at the NCBI is far better curated for quality and really focuses on type strains.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/qunfengdong/BLCA/issues/20?email_source=notifications&email_token=AEOBXE3RAYCPRTH3RWM333LQAGXNNA5CNFSM4IECAP4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2LPPII#issuecomment-513210273, or mute the thread https://github.com/notifications/unsubscribe-auth/AEOBXE76FVU7K3PG33BDKLLQAGXNNANCNFSM4IECAP4A .

dswan commented 5 years ago

If you can remove those poor sequences in NCBI 16S database, I do believe that it'd be better. Any other ITS loci sequences should also work as long as you can compile the corresponding taxonomic annotation.

I did wonder how BLAST handled these ambiguities, but I assume they would be penalised.

wolfgangrumpf commented 5 years ago

I saw that there are instructions for generating the SILVA LSU database for BLCA, but not for the SSU - I don’t suppose anyone has done this already? Or will greengenes provide sufficient resolution?

Cheers,

Wolfgang Rumpf, Ph.D. ———————————— Bioinformatics Analyst The Institute for Genomic Medicine at The Abigail Wexner Research Institute Nationwide Children’s Hospital —————————————- Professor University of Maryland Global Campus

On Jul 19, 2019, at 10:42 AM, qunfengdong notifications@github.com wrote:

If you can remove those poor sequences in NCBI 16S database, I do believe that it'd be better. Any other ITS loci sequences should also work as long as you can compile the corresponding taxonomic annotation.

On Fri, Jul 19, 2019 at 7:28 AM Dr. Daniel Swan notifications@github.com wrote:

There's also a plethora of sequences in the NCBI 16S database with ambiguous nucleotides, I'd thought of applying a filter for removing some of the more egregiously poor sequences actually. It's a shame because the ITS targetted loci project at the NCBI is far better curated for quality and really focuses on type strains.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/qunfengdong/BLCA/issues/20?email_source=notifications&email_token=AEOBXE3RAYCPRTH3RWM333LQAGXNNA5CNFSM4IECAP4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2LPPII#issuecomment-513210273, or mute the thread https://github.com/notifications/unsubscribe-auth/AEOBXE76FVU7K3PG33BDKLLQAGXNNANCNFSM4IECAP4A .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

qunfengdong commented 5 years ago

Yes, BLAST should penalize those.

On Fri, Jul 19, 2019 at 9:57 AM Dr. Daniel Swan notifications@github.com wrote:

If you can remove those poor sequences in NCBI 16S database, I do believe that it'd be better. Any other ITS loci sequences should also work as long as you can compile the corresponding taxonomic annotation.

I did wonder how BLAST handled these ambiguities, but I assume they would be penalised.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/qunfengdong/BLCA/issues/20?email_source=notifications&email_token=AEOBXE5NLH6IDXDWATX3JMTQAHI53A5CNFSM4IECAP4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2L4BBA#issuecomment-513261700, or mute the thread https://github.com/notifications/unsubscribe-auth/AEOBXE5RXZ3NBMU7JPSWTYTQAHI53ANCNFSM4IECAP4A .

qunfengdong commented 5 years ago

No, we have not tried neither SILVA LSU nor SSU (the LSU instruction was provided kindly by Dr. Daniel Swan), and we have not done any systematic comparison to greengenes either. We are just providing those options available for the community to use. Sometimes, we do apply multiple databases to our own projects.

On Fri, Jul 19, 2019 at 9:58 AM Wolfgang Rumpf notifications@github.com wrote:

I saw that there are instructions for generating the SILVA LSU database for BLCA, but not for the SSU - I don’t suppose anyone has done this already? Or will greengenes provide sufficient resolution?

Cheers,

Wolfgang Rumpf, Ph.D. ———————————— Bioinformatics Analyst The Institute for Genomic Medicine at The Abigail Wexner Research Institute Nationwide Children’s Hospital —————————————- Professor University of Maryland Global Campus

On Jul 19, 2019, at 10:42 AM, qunfengdong notifications@github.com wrote:

If you can remove those poor sequences in NCBI 16S database, I do believe that it'd be better. Any other ITS loci sequences should also work as long as you can compile the corresponding taxonomic annotation.

On Fri, Jul 19, 2019 at 7:28 AM Dr. Daniel Swan < notifications@github.com> wrote:

There's also a plethora of sequences in the NCBI 16S database with ambiguous nucleotides, I'd thought of applying a filter for removing some of the more egregiously poor sequences actually. It's a shame because the ITS targetted loci project at the NCBI is far better curated for quality and really focuses on type strains.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub < https://github.com/qunfengdong/BLCA/issues/20?email_source=notifications&email_token=AEOBXE3RAYCPRTH3RWM333LQAGXNNA5CNFSM4IECAP4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2LPPII#issuecomment-513210273 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AEOBXE76FVU7K3PG33BDKLLQAGXNNANCNFSM4IECAP4A

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/qunfengdong/BLCA/issues/20?email_source=notifications&email_token=AEOBXE6TZRSWERF5PXQ6RYDQAHJB3A5CNFSM4IECAP4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2L4EBY#issuecomment-513262087, or mute the thread https://github.com/notifications/unsubscribe-auth/AEOBXE4NJJ3WGIBGFDZFC7TQAHJB3ANCNFSM4IECAP4A .