Closed kdahlquist closed 9 years ago
For RefSeq, the ID WP_001201520
(to be precise, WP_001201520.1
—GenMAPP Builder drops the .n
at the end) appears twice in both the XML file and the database. Likely due to a uniqueness check, this appears only once in the GDB. This explains the discrepancy of 1 in the TallyEngine.
For OrderedLocusNames, the ID VC_1738/VC_1739
appears in both the XML file and the database. GenMAPP Builder splits this into two records, VC_1738
and VC_1739
. This explains the additional two records in the GDB (one with the underscore, another without).
Thus, as far as I can tell, these discrepancies are particular to the data, and not indicative of a bug in GenMAPP Builder. Please review and let me know what you think.
I agree. I've been able to check those specifically and see no problems with the adjustments made by GenMAPP Builder. Can we close this?
Yep, happy to close it :)
When vetting the Vibrio export from builds 3 and 4 from issue #3 I compared the TallyEngine results with the OriginalRowCounts table and found some discrepancies:
UniProt matches OK GeneID matches OK RefSeq has 6550 in XML, Database, but 6549 in OriginalRowCounts OrderedLocusNames has 3831 in XML, Database, which would be 7662 when doubled, but the OriginalRowCounts has 7664. GO Terms XML and Database match each other in TallyEngine, but have no correct comparison group in the gdb.
I will do some sleuthing to figure out the lost IDs, but I think this will have to be later since I need to get my syllabus ready for Monday.