malariagen / fits

File tracking system for group DK
0 stars 0 forks source link

Wrong ENA sample accession given to some samples #22

Open tnguyensanger opened 6 years ago

tnguyensanger commented 6 years ago

Sometimes fits lists an additional erroneous ENA sample accession for a sample.

Here is an example:

select fits_sample_id, mlw_sample_id, oxford_sample_id, ena_sample_accession_id from vw_pivot_sample
where vw_pivot_sample.oxford_sample_id like 'AC0009-C%';
# fits_sample_id mlw_sample_id oxford_sample_id ena_sample_accession_id
13498 1579623 AC0009-Cx ERS223874
104435 1579623 AC0009-C ERS177536|ERS223874

Sample AC0009-C was resequenced and given oxford code AC0009-Cx, which would explain why AC0009-Cx has the same ENA sample accession id as AC0009-C. However, AC0009-C has two ENA sample accessions: ERS223874, which is accurate and ERS177536, which is incorrect.

ERS177536 should belong to sample 'AD0687-C', but is attributed to multiple samples in fits:

select fits_sample_id, mlw_sample_id, oxford_sample_id, ena_sample_accession_id from vw_pivot_sample
where vw_pivot_sample.ena_sample_accession_id like '%ERS177536%';
# fits_sample_id mlw_sample_id oxford_sample_id ena_sample_accession_id
3700 1465120 AD0687-C ERS177536
18403 2590377 ERS177536
104435 1579623 AC0009-C ERS177536|ERS223874
104459 1582001 AV0004-C ERS177536|ERS224852

The files for sample accession ERS177536 are not multiplexed with samples with other ENA accessions:

SELECT file_name,ebi_sample_acc from submission as submission1
where submission1.ebi_sample_acc like '%ERS177536%';
# file_name ebi_sample_acc
8763_2#22.bam ERS177536
8812_7#22.bam ERS177536
8812_8#22.bam ERS177536
select  file_name, ebi_sample_acc from submission
where submission.file_name in 
('8763_2#22.bam', '8812_7#22.bam', '8812_8#22.bam') ;
# file_name ebi_sample_acc
8763_2#22.bam ERS177536
8812_7#22.bam ERS177536
8812_8#22.bam ERS177536
magnusmanske commented 6 years ago

I found the bug that caused this. I'll have to remove and re-import a bunch of tags. The bug affected ena_sample_accession_id only, other tags are fine. Will keep you updated.

magnusmanske commented 6 years ago

Code has been fixed.

All suspect entries have been removed from, and re-imported into, FITS.

Please check your examples, and close the issue if convinced!

tnguyensanger commented 6 years ago

There are 560 FITS file entries in which FITS is missing an ENA sample accession and 3 FITS file entries in which the AG1K sample metadata sheet in /vector-ops/meta/ag1000g/samples.csv is incorrectly missing an ENA sample accession whereas FITS contains the correct accession.

See https://github.com/malariagen/vector-ops/commit/b038f030fa5978d7be860676803d9da0abb8c5e3

magnusmanske commented 6 years ago

I started with https://github.com/malariagen/vector-ops/blob/master/meta/ag1000g/samples.csv and took the 4768 Oxford codes that have an ENA sample accession in that file.

If I query for these Oxford codes limited to Oxford code tag, I get FITS 4400 samples. That is, at least 368 samples have no/different annotation for "Oxford code". We need a better system for annotating these, such as a sample tracking system perhaps...

If I allow for any tag with one of the Oxford codes as value, I get 4814 FITS samples (which might be OK, as multiple FITS samples can share an Oxford code, just like sequenscape samples).

Using the latter, I get 2814 samples. Query:

SELECT 
(SELECT group_concat(value) FROM sample2tag s2 WHERE s1.sample_id=s2.sample_id AND s2.tag_id=3561) AS Oxford_code,
(SELECT group_concat(value) FROM sample2tag s2 WHERE s1.sample_id=s2.sample_id AND s2.tag_id=3587) AS ENA_sample
FROM sample2tag s1 WHERE value IN ("Oxfordcode1","Oxfordcode2",...)
) GROUP BY sample_id ORDER BY Oxford_code

I'll check why the ENA samples IDs are missing for some of them.

tnguyensanger commented 6 years ago

I started with https://github.com/malariagen/vector-ops/blob/master/meta/ag1000g/samples.csv and took the 4768 Oxford codes that have an ENA sample accession in that file.

Unfortunately, the column ox_code in https://github.com/malariagen/vector-ops/blob/master/meta/ag1000g/samples.csv is poorly named. It's really a general sample identifer column. Sometimes it's an oxford code. Sometimes it's a Roma sample name. Whatever is used to identify the sample on our end.