populationgenomics / metamist

Sample level metadata system
MIT License
1 stars 1 forks source link

Update parser sg meta #757

Closed EddieLF closed 3 months ago

EddieLF commented 4 months ago

Some changes to the parsing module to allow more flexible meta fields to be added upon assay ingestion.

The main changes:

sample_file_map_parser.py

generic_metadata_parser.py

generic_parser.py


E.g. some_sample_file_map.csv

Individual ID,Sample ID,Filenames,Type,haplotag,somefield
IND01,Sample01,"AssayFile01.fq.gz,AssayFile02.fq.gz",wgs,True,somevalue

Instantiate a SampleFileMapParser with ignore_extra_keys:

parser = SampleFileMapParser(
        project=DATASET,
        search_locations=[search_path],
        allow_extra_files_in_search_path=True,
        default_sequencing=DefaultSequencing(
            seq_type='genome',
            technology='long-read',
            platform='pacbio',
        ),
        assay_meta_columns=['haplotag', 'somefield'],
        ignore_extra_keys=True
    )

Ingesting the filemap with this Parser will create assays and sequencing groups with meta fields:

{
  'sequencing_platform': 'pacbio',
  'sequencing_technology': 'long-read',
  'sequencing_type': 'genome',
  'haplotag': True,
  'somefield': 'somevalue'
}
codecov-commenter commented 4 months ago

Codecov Report

Attention: Patch coverage is 93.58974% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 77.04%. Comparing base (d4b498c) to head (b6b334a).

Files Patch % Lines
metamist/parser/generic_metadata_parser.py 90.56% 5 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## dev #757 +/- ## ========================================== + Coverage 76.98% 77.04% +0.05% ========================================== Files 157 157 Lines 13026 13060 +34 ========================================== + Hits 10028 10062 +34 Misses 2998 2998 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

EddieLF commented 3 months ago

@illusional thanks again for the review and feedback.

I've implemented the feedback from your review:


Things I'd like more feedback on

To demonstrate this, in the test we have two assays (each a pair of fastqs) in the filemap.

'Individual ID\tSample ID\tFilename\tAssay Meta 1\tAssay Meta 2',
'Athena\tsample_id003\t"sample_id003_01-R1.fastq.gz,sample_id003_01-R2.fastq.gz"\tTrue\tsome_value',
'Athena\tsample_id003\t"sample_id003_02-R1.fastq.gz,sample_id003_02-R2.fastq.gz"\t5\tFalse',

Both assays will be grouped into a single sequencing group upon ingestion. This means the assay meta will be collapsed. Instead of

assay_1_meta = {..., 'assay_meta_1': True, 'assay_meta_2': 'some_value', ... }
assay_2_meta = {..., 'assay_meta_1': '5', 'assay_meta_2': False, ... }

The assay meta gets collapsed by the grouping, and both assays end up sharing an identical meta:

assay_1_and_2_meta = {..., 'assay_meta_1': [True, '5'], 'assay_meta_2': [False, 'some_value'], ... }

Although the assays are grouped together to form the sequencing group, and perhaps the sequencing group should have the collapsed meta, should each individual assay maintain it's own distinct meta if that's what the filemap contains? If so, can you see an easy way to achieve this that won't break other cases?

EddieLF commented 3 months ago

Thanks for taking a look @violetbrina!

Do we even care much about the types for the meta field? It's all effectively just stored as a string in the SQL db anyway.

I think you're right that I'd be better off not caring about types for values stored in meta. They just look so much nicer in the GraphQL view when they're properly typed... I can live with it though.

As for the meta collapsing

A circumstance I thought up: What if we had two assays from the same sample, but from different sequencing runs? One assay is from the initial sequencing, and the second assay is from a "top-up" sequencing run a month later.

We might want to store a field "sequencing_date" in the meta for each of the two assays, each with a different date. If we ingested these two assays simultaneously, the collapse_meta function would concatenate the two separate "sequencing_date" values into a list and store that in the meta of both assays. But really, we would want to keep those values separate for each assay, and avoid the collapse altogether.

I think it would make sense if the sequencing group that groups the two assays has the collapsed meta list. But each assay should have a distinct date in it's own meta.

illusional commented 3 months ago

A circumstance I thought up: What if we had two assays from the same sample, but from different sequencing runs? One assay is from the initial sequencing, and the second assay is from a "top-up" sequencing run a month later.

This would be a different Assay, so it gets its own assay meta (noting we group assays by filenames, as that has been the most stable way between all the providers we integrate with).

We might want to store a field "sequencing_date" in the meta for each of the two assays, each with a different date. If we ingested these two assays simultaneously, the collapse_meta function would concatenate the two separate "sequencing_date" values into a list and store that in the meta of both assays. But really, we would want to keep those values separate for each assay, and avoid the collapse altogether.

This shouldn't happen, because the collapse assay meta only happens when it finds multiple rows for the same grouped assay.

EG:

participant sample filename assay_meta_1
p1 s1 p1-s1-01_R1.fastq.gz assay1_forward
p1 s1 p1-s1-01_R2.fastq.gz assay1_back
p1 s1 p1-s1-02_R1.fastq.gz,p1-s1-02_R2.fastq.gz assay2_mixed

This gets grouped as two assays:

  1. p1-s1-01_R*.fastq.gz, and the collapse assay function is provided the appropriate 2 rows
  2. p1-s1-02_R*.fastq.gz, and the collapse assay meta function is only provided 1 row.

It shouldn't happen, but if you have a forward with one experiment, and a matched reverse from another experiment, you probably got other problems to deal with...


I think it would make sense if the sequencing group that groups the two assays has the collapsed meta list. But each assay should have a distinct date in it's own meta

You can find this information on the assay, the sequencing group is just a thin grouping of sequencing assays, it should only store information about the grouping, NOT about specific experiments run within it. IE: We don't actually need to store the sequencing-type, technology, platform - we do because it's easier to query, but because we guarantee they're the same for every assay, you could just lookup one assay to get those values.

Hope this makes sense, maybe zoom might be a better back and forth medium :)

EddieLF commented 3 months ago

@illusional massive thanks again for coming back to repeatedly review this. I'm satisfied with the tests and your explanation makes sense!

I have gone back and updated the string_to_bool function to also handle ints and floats. This meant a few other tests (test_parse_existing_cohort, test_parse_ont_sheet) also had to be updated to account for the new de-stringing behaviour in assay meta. I'm gonna go ahead and merge this and then open a release PR 😄