Mismatch between clinical_significance_ordered and submitters_ordered

giladmishne commented 6 years ago

Hi,

Thanks for releasing this great resources. I noticed some discrepancies between the semicolon-separated lists in clinical_significance_ordered and submitters_ordered:

In [1]: df = pd.read_csv('clinvar_alleles_example_750_rows.single.b37.tsv', sep='\t')

In [2]: df.shape
Out[2]: (749, 39)

In [3]: for col in 'rcv scv clinical_significance_ordered submitters_ordered'.split():
    ...:     df['len_' + col] = df[col].apply(lambda x: len(x.split(';')))

In [4]: diffs = df[df.len_clinical_significance_ordered != df.len_submitters_ordered].shape

In [5]: diffs.shape
Out[5]: (120, 43)

Ordered clinical significance doesn't seem to match the RCV or SCV lists either. Is this intended?

Thanks

kristjaneerik commented 6 years ago

I believe my PR https://github.com/macarthur-lab/clinvar/pull/51 fixes this, but it is still being reviewed..

giladmishne commented 6 years ago

Thanks @kristjaneerik !

macarthur-lab / clinvar

Mismatch between clinical_significance_ordered and submitters_ordered #54