ohnosequences / mg7

Configurable and scalable 16S metagenomics data analysis
https://goo.gl/y3rZFD
GNU Affero General Public License v3.0
3 stars 3 forks source link

Add PacBio 16S test data run #102

Closed eparejatobes closed 8 years ago

eparejatobes commented 8 years ago

Check https://github.com/era7bio/mg7-test/blob/master/docs/pacbio-mock-tests.md

rtobes commented 8 years ago

http://downloads.hmpdacc.org/data/HMMC/HMPRP_sT1-Mock.pdf

marina-manrique commented 8 years ago

@eparejatobes @rtobes this is the information I have found about the pacbio datasets we have (it's a mail from Richard Hall). Unfortunately I don't find the link to the ID from the BEI Catalog

  1. The sequence of the primers used to do the amplicons P1 AGRGTTYGATYMTGGCTCAG P2 RGYTACCTTGTTACGACTT
  2. The filtering and preprocessing protocols (correction of CCSs, trimming,....) and the used >parameters
    I applied no further filtering than basic CCS parameters, 3 passes of the insert, 0.9 predicted >accuracy. I can provide data further filtered for predicted accuracy, you can also achieve the >same filtering using the base QV values.
  3. Why are some reads very much larger than the length of the amplicon? It is possible for chimeras to form between amplicons, or in a small percentage of cases an >adapter is missing on one side, forming a palindromic insert sequence. A simple length filter >should remove these reads, without adversely effecting yield.
  4. What are exactly the mock communities that we have? BEI - http://downloads.hmpdacc.org/data/HMMC/HMPRP_sT1-Mock.pdf the amplicons are >generated from the even and staggered genomic samples. Sakinaw is a real environmental sample. CAMI - is from https://data.cami-challenge.org/participate I'm not sure exactly which sample it >is, maybe Cheryl knows?
  5. Are the quality values phred33-encoded quality values? Yes phred scores are standard sanger format "from 0 to 93 using ASCII 33 to 126"

I can also provide data filtered using my 16S pipeline, if you are interested in a cleaner dataset?

rtobes commented 8 years ago

The pdf come from here:

And there the IDs for BEI DNA are these:

marina-manrique commented 8 years ago

From the hmpdacc website

Mock communities are available to the community through the BEI Resource as both a cell mixture (BEI:HM-280, HM-281) and a genomic DNA extract (BEI:HM-278D, HM-279D).

eparejatobes commented 8 years ago

OK thank you, I'll take a look at all this later today.

eparejatobes commented 8 years ago

Waiting for #86

eparejatobes commented 8 years ago

The code is here, together with the mock communities. We need to review where all test input data is, and then I will fix the input data mappings etc.

eparejatobes commented 8 years ago

LGTM