husonlab / megan-ce

MEGAN Community Edition
GNU General Public License v3.0
65 stars 22 forks source link

CSV summary import, square brackets in taxon names #2

Open mdeleeuw opened 8 years ago

mdeleeuw commented 8 years ago

I'm trying to import mock community composition in CSV taxonomy summary format into MEGAN CE. Taxon names with square brackets as in "[Ruminococcus] gnavus" get imported at the wrong taxonomic level. I traced the problem down to src/megan/classification/util/MultiWords.java line 72. Because of the "&& (Character.isLetter(line.charAt(i)))" condition, src/megan/classification/IdParser.java starting line 285 tests for "Ruminococcus] gnavus" among other strings, but never for "[Ruminococcus] gnavus". I also had to comment out lines 255-257 of IdParser.java to avoid accounting for the counts at the Order level. The below CSV content reflect the 48 strains of the mock-1 community from

Bokulich, N. A., Rideout, J. R., Mercurio, W. G., Wolfe, B., F, C., Maurice, et al. (2016). mockrobiota: a public resource for microbiome bioinformatics benchmarking. PeerJ, 1–16. http://doi.org/10.7287/peerj.preprints.2065v1 https://github.com/caporaso-lab/mockrobiota

and can be used to reproduce the issue.

Bifidobacterium pseudocatenulatum,1000 Bifidobacterium bifidum,1000 Collinsella intestinalis,1000 Alistipes indistinctus,1000 Bacteroides ovatus,1000 Bacteroides uniformis,1000 Bacteroides cellulosilyticus,1000 Bacteroides thetaiotaomicron VPI-5482,1000 Bacteroides thetaiotaomicron,1000 Bacteroides thetaiotaomicron,1000 Bacteroides vulgatus,1000 Bacteroides xylanisolvens,1000 Bacteroides intestinalis,1000 Bacteroides eggerthii,1000 Bacteroides dorei,1000 Bacteroides finegoldii,1000 Parabacteroides johnsonii,1000 Anaerococcus hydrogenalis,1000 Anaerotruncus colihominis,1000 Blautia luti,1000 Blautia hansenii,1000 Tyzzerella nexilis,1000 Clostridium sp. A2-232,1000 [Clostridium] leptum,1000 [Clostridium] saccharolyticum,1000 [Clostridium] asparagiforme,1000 Hungatella hathewayi,1000 Clostridium sporogenes,1000 Coprococcus comes,1000 Dorea formicigenerans,1000 Dorea longicatena,1000 [Eubacterium] eligens,1000 Eubacterium ventriosum,1000 Holdemanella biformis,1000 Faecalibacterium prausnitzii M21/2,1000 Roseburia intestinalis,1000 [Ruminococcus] gnavus,1000 Ruminococcus lactaris,1000 [Ruminococcus] torques,1000 Streptococcus infantarius,1000 Subdoligranulum variabile,1000 Edwardsiella tarda,1000 Enterobacter cancerogenus,1000 Escherichia coli K12,1000 Escherichia fergusonii,1000 Proteus penneri,1000 Providencia alcalifaciens,1000 Akkermansia muciniphila,1000