mendessoares / MMinte

16 stars 10 forks source link

Non-unique header names in BLAST and effects on PctSIM #4

Open ctseto opened 8 years ago

ctseto commented 8 years ago

Summary: Not having unique names in the headers may cause mismatches between PctSim and e-value/bitscore, blastn output issue more than a MMinte issue.

Background: Building my own blast databases, I am encountering issues with match assignments.

For example, with the current 16Sdb a given OTU is matching with 100 percent identity, 0 mismatch, 2e-120 bitscore ; but when the database is augmented with new sequences the best match switches to a genomeid with 27 percent identity, 170 mismatch. The effect is observed when a database is constructed of the new sequences alone. Re-testing with the new sequences alone, I find the 27% identity anomaly is definitely associated with the new sequences

Below: BLAST results before pre-processing names to make them unique: denovo830 59620 27.35 234 170 0 1 234 523 756 5e-124 433
denovo830 59620 27.35 234 170 0 1 234 523 756 5e-124 433
denovo830 59620 27.35 234 170 0 1 234 523 756 5e-124 433 denovo830 59620 27.35 234 170 0 1 234 523 756 5e-124 433 denovo830 59620 21.84 206 147 12 1 199 540 738 3e-46 174

(inspection of the alignment from blast output shows a very bad alignment closer to 27% identity/170 mismatch than 5e-124 evalue and 433 bitscore)

Analysis of the sequence headers in the new set suggest that 38 of them share a taxa_id, thus they also share a header, which appears to mess with parsing: (

59620 ... 59620 ... )

After correcting names into uniqueness:

denovo830 59620.37 100.00 234 0 0 1 234 523 756 5e-124 433 denovo830 59620.14 100.00 234 0 0 1 234 523 756 5e-124 433 denovo830 59620.8 100.00 234 0 0 1 234 523 756 5e-124 433 denovo830 59620.7 100.00 234 0 0 1 234 523 756 5e-124 433 denovo830 59620.23 83.01 206 21 12 1 199 540 738 3e-46 174

At this point, the >59620 sequence that was a mere 27% identity is no longer highly ranked.

If this particular issue is driven by how e-value/bitscore et al are assigned in the case of duplicate headers, then there is a possibility that when a given OTU representative has a strong match to one of a set, results might not always be reported for the representative. In the above case, 59620 does have a high-evalue and high bitscore match, but reported a 27 percent identity, which is passed into MMinte as PctSim. This may have an effect on components of MMinte that rely on the PctSim value.

Edit: This would probably have the most effect in cases where a given taxa has multiple associated genomeIDs, and said genomeID's have 16S sufficiently distinct such that a query sequence would produce different percent identity for each of the similar sequences.