Open tnguyensanger opened 6 years ago
I found the bug that caused this. I'll have to remove and re-import a bunch of tags. The bug affected ena_sample_accession_id only, other tags are fine. Will keep you updated.
Code has been fixed.
All suspect entries have been removed from, and re-imported into, FITS.
Please check your examples, and close the issue if convinced!
There are 560 FITS file entries in which FITS is missing an ENA sample accession and 3 FITS file entries in which the AG1K sample metadata sheet in /vector-ops/meta/ag1000g/samples.csv is incorrectly missing an ENA sample accession whereas FITS contains the correct accession.
See https://github.com/malariagen/vector-ops/commit/b038f030fa5978d7be860676803d9da0abb8c5e3
I started with https://github.com/malariagen/vector-ops/blob/master/meta/ag1000g/samples.csv and took the 4768 Oxford codes that have an ENA sample accession in that file.
If I query for these Oxford codes limited to Oxford code tag, I get FITS 4400 samples. That is, at least 368 samples have no/different annotation for "Oxford code". We need a better system for annotating these, such as a sample tracking system perhaps...
If I allow for any tag with one of the Oxford codes as value, I get 4814 FITS samples (which might be OK, as multiple FITS samples can share an Oxford code, just like sequenscape samples).
Using the latter, I get 2814 samples. Query:
SELECT
(SELECT group_concat(value) FROM sample2tag s2 WHERE s1.sample_id=s2.sample_id AND s2.tag_id=3561) AS Oxford_code,
(SELECT group_concat(value) FROM sample2tag s2 WHERE s1.sample_id=s2.sample_id AND s2.tag_id=3587) AS ENA_sample
FROM sample2tag s1 WHERE value IN ("Oxfordcode1","Oxfordcode2",...)
) GROUP BY sample_id ORDER BY Oxford_code
I'll check why the ENA samples IDs are missing for some of them.
I started with https://github.com/malariagen/vector-ops/blob/master/meta/ag1000g/samples.csv and took the 4768 Oxford codes that have an ENA sample accession in that file.
Unfortunately, the column ox_code in https://github.com/malariagen/vector-ops/blob/master/meta/ag1000g/samples.csv is poorly named. It's really a general sample identifer column. Sometimes it's an oxford code. Sometimes it's a Roma sample name. Whatever is used to identify the sample on our end.
Sometimes fits lists an additional erroneous ENA sample accession for a sample.
Here is an example:
Sample AC0009-C was resequenced and given oxford code AC0009-Cx, which would explain why AC0009-Cx has the same ENA sample accession id as AC0009-C. However, AC0009-C has two ENA sample accessions: ERS223874, which is accurate and ERS177536, which is incorrect.
ERS177536 should belong to sample 'AD0687-C', but is attributed to multiple samples in fits:
The files for sample accession ERS177536 are not multiplexed with samples with other ENA accessions: