Update parser sg meta - Githubissues

EddieLF commented 4 months ago

Some changes to the parsing module to allow more flexible meta fields to be added upon assay ingestion.

The main changes:

sample_file_map_parser.py

Add **kwargs to SampleFileMapParser instantiation

generic_metadata_parser.py

Add optional parser arg assay_meta_columns: list[str] (and remove unused seq_meta_column arg). To use this, pass a list of column header strings, and the parser will read the values in those columns and add them to the assay meta via the new get_assay_meta function - See below example.
Remove Typing typehints in favor of standard python typehints (Note that a lot of the diff is from this, if this is unnecessary or unwanted I'm happy to revert.)

generic_parser.py

When an assay with sequencing_technology == 'long-read' is ingested, lift the assay meta fields into the sequencing group meta, like what is being done with exome and RNA assays. This will copy all meta fields except for reads and reads_type from the assay to the sequencing_group as it is created.

E.g. some_sample_file_map.csv

Individual ID,Sample ID,Filenames,Type,haplotag,somefield
IND01,Sample01,"AssayFile01.fq.gz,AssayFile02.fq.gz",wgs,True,somevalue

Instantiate a SampleFileMapParser with ignore_extra_keys:

parser = SampleFileMapParser(
        project=DATASET,
        search_locations=[search_path],
        allow_extra_files_in_search_path=True,
        default_sequencing=DefaultSequencing(
            seq_type='genome',
            technology='long-read',
            platform='pacbio',
        ),
        assay_meta_columns=['haplotag', 'somefield'],
        ignore_extra_keys=True
    )

Ingesting the filemap with this Parser will create assays and sequencing groups with meta fields:

{
  'sequencing_platform': 'pacbio',
  'sequencing_technology': 'long-read',
  'sequencing_type': 'genome',
  'haplotag': True,
  'somefield': 'somevalue'
}

codecov-commenter commented 4 months ago

Codecov Report

Attention: Patch coverage is 93.58974% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 77.04%. Comparing base (d4b498c) to head (b6b334a).

Files	Patch %	Lines
metamist/parser/generic_metadata_parser.py	90.56%	5 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## dev #757 +/- ## ========================================== + Coverage 76.98% 77.04% +0.05% ========================================== Files 157 157 Lines 13026 13060 +34 ========================================== + Hits 10028 10062 +34 Misses 2998 2998 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

EddieLF commented 3 months ago

@illusional thanks again for the review and feedback.

I've implemented the feedback from your review:

Remove get_assay_meta function and parser input arg assay_meta_columns; in favour of existing parser function collapse_arbitrary_meta and argument assay_meta_map
Group, reorder, and comment on the various inputs to the generic metadata parser
Add more complexity to the unit test, including multiple assays with different meta fields and values

Things I'd like more feedback on

Added an inner function to collapse_arbitrary_meta called string_to_bool - the idea is to convert strings to Bools where possible when writing the meta dict values. Note: This could be extended to number values too. To me, this makes sense. If the input filemap has a value of "123.45" or "500", I think these should enter the meta dict as 123.45 or 500 respectively. This would require overhauling a number of other tests to account for this though. Not a huge priority but do you agree that bools / nums should be in the db as bools/nums? Even when they're in a meta dict? Does it even matter?
collapse_arbitrary_meta creates the assay meta dict from ALL assays associated with a sequencing group. I'm not sure how important this might be, but it could be useful to limit some assay meta fields to singular assays, even if they form a sequencing group.

To demonstrate this, in the test we have two assays (each a pair of fastqs) in the filemap.

'Individual ID\tSample ID\tFilename\tAssay Meta 1\tAssay Meta 2',
'Athena\tsample_id003\t"sample_id003_01-R1.fastq.gz,sample_id003_01-R2.fastq.gz"\tTrue\tsome_value',
'Athena\tsample_id003\t"sample_id003_02-R1.fastq.gz,sample_id003_02-R2.fastq.gz"\t5\tFalse',

Both assays will be grouped into a single sequencing group upon ingestion. This means the assay meta will be collapsed. Instead of

assay_1_meta = {..., 'assay_meta_1': True, 'assay_meta_2': 'some_value', ... }
assay_2_meta = {..., 'assay_meta_1': '5', 'assay_meta_2': False, ... }

The assay meta gets collapsed by the grouping, and both assays end up sharing an identical meta:

assay_1_and_2_meta = {..., 'assay_meta_1': [True, '5'], 'assay_meta_2': [False, 'some_value'], ... }

Although the assays are grouped together to form the sequencing group, and perhaps the sequencing group should have the collapsed meta, should each individual assay maintain it's own distinct meta if that's what the filemap contains? If so, can you see an easy way to achieve this that won't break other cases?

EddieLF commented 3 months ago

Thanks for taking a look @violetbrina!

Do we even care much about the types for the meta field? It's all effectively just stored as a string in the SQL db anyway.

I think you're right that I'd be better off not caring about types for values stored in meta. They just look so much nicer in the GraphQL view when they're properly typed... I can live with it though.

As for the meta collapsing

A circumstance I thought up: What if we had two assays from the same sample, but from different sequencing runs? One assay is from the initial sequencing, and the second assay is from a "top-up" sequencing run a month later.

We might want to store a field "sequencing_date" in the meta for each of the two assays, each with a different date. If we ingested these two assays simultaneously, the collapse_meta function would concatenate the two separate "sequencing_date" values into a list and store that in the meta of both assays. But really, we would want to keep those values separate for each assay, and avoid the collapse altogether.

I think it would make sense if the sequencing group that groups the two assays has the collapsed meta list. But each assay should have a distinct date in it's own meta.

illusional commented 3 months ago

A circumstance I thought up: What if we had two assays from the same sample, but from different sequencing runs? One assay is from the initial sequencing, and the second assay is from a "top-up" sequencing run a month later.

This would be a different Assay, so it gets its own assay meta (noting we group assays by filenames, as that has been the most stable way between all the providers we integrate with).

We might want to store a field "sequencing_date" in the meta for each of the two assays, each with a different date. If we ingested these two assays simultaneously, the collapse_meta function would concatenate the two separate "sequencing_date" values into a list and store that in the meta of both assays. But really, we would want to keep those values separate for each assay, and avoid the collapse altogether.

This shouldn't happen, because the collapse assay meta only happens when it finds multiple rows for the same grouped assay.

EG:

participant	sample	filename	assay_meta_1
p1	s1	p1-s1-01_R1.fastq.gz	assay1_forward
p1	s1	p1-s1-01_R2.fastq.gz	assay1_back
p1	s1	p1-s1-02_R1.fastq.gz,p1-s1-02_R2.fastq.gz	assay2_mixed

This gets grouped as two assays:

p1-s1-01_R*.fastq.gz, and the collapse assay function is provided the appropriate 2 rows
p1-s1-02_R*.fastq.gz, and the collapse assay meta function is only provided 1 row.

It shouldn't happen, but if you have a forward with one experiment, and a matched reverse from another experiment, you probably got other problems to deal with...

I think it would make sense if the sequencing group that groups the two assays has the collapsed meta list. But each assay should have a distinct date in it's own meta

You can find this information on the assay, the sequencing group is just a thin grouping of sequencing assays, it should only store information about the grouping, NOT about specific experiments run within it. IE: We don't actually need to store the sequencing-type, technology, platform - we do because it's easier to query, but because we guarantee they're the same for every assay, you could just lookup one assay to get those values.

Hope this makes sense, maybe zoom might be a better back and forth medium :)

EddieLF commented 3 months ago

@illusional massive thanks again for coming back to repeatedly review this. I'm satisfied with the tests and your explanation makes sense!

I have gone back and updated the string_to_bool function to also handle ints and floats. This meant a few other tests (test_parse_existing_cohort, test_parse_ont_sheet) also had to be updated to account for the new de-stringing behaviour in assay meta. I'm gonna go ahead and merge this and then open a release PR 😄

populationgenomics / metamist

Update parser sg meta #757

The main changes:

Codecov Report

Things I'd like more feedback on