Closed EddieLF closed 3 months ago
Attention: Patch coverage is 93.58974%
with 5 lines
in your changes are missing coverage. Please review.
Project coverage is 77.04%. Comparing base (
d4b498c
) to head (b6b334a
).
Files | Patch % | Lines |
---|---|---|
metamist/parser/generic_metadata_parser.py | 90.56% | 5 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@illusional thanks again for the review and feedback.
I've implemented the feedback from your review:
get_assay_meta
function and parser input arg assay_meta_columns
; in favour of existing parser function collapse_arbitrary_meta
and argument assay_meta_map
Added an inner function to collapse_arbitrary_meta
called string_to_bool
- the idea is to convert strings to Bools where possible when writing the meta dict values.
Note: This could be extended to number values too. To me, this makes sense. If the input filemap has a value of "123.45"
or "500"
, I think these should enter the meta dict as 123.45
or 500
respectively. This would require overhauling a number of other tests to account for this though. Not a huge priority but do you agree that bools / nums should be in the db as bools/nums? Even when they're in a meta dict? Does it even matter?
collapse_arbitrary_meta
creates the assay meta dict from ALL assays associated with a sequencing group. I'm not sure how important this might be, but it could be useful to limit some assay meta fields to singular assays, even if they form a sequencing group.
To demonstrate this, in the test we have two assays (each a pair of fastqs) in the filemap.
'Individual ID\tSample ID\tFilename\tAssay Meta 1\tAssay Meta 2',
'Athena\tsample_id003\t"sample_id003_01-R1.fastq.gz,sample_id003_01-R2.fastq.gz"\tTrue\tsome_value',
'Athena\tsample_id003\t"sample_id003_02-R1.fastq.gz,sample_id003_02-R2.fastq.gz"\t5\tFalse',
Both assays will be grouped into a single sequencing group upon ingestion. This means the assay meta will be collapsed. Instead of
assay_1_meta = {..., 'assay_meta_1': True, 'assay_meta_2': 'some_value', ... }
assay_2_meta = {..., 'assay_meta_1': '5', 'assay_meta_2': False, ... }
The assay meta gets collapsed by the grouping, and both assays end up sharing an identical meta:
assay_1_and_2_meta = {..., 'assay_meta_1': [True, '5'], 'assay_meta_2': [False, 'some_value'], ... }
Although the assays are grouped together to form the sequencing group, and perhaps the sequencing group should have the collapsed meta, should each individual assay maintain it's own distinct meta if that's what the filemap contains? If so, can you see an easy way to achieve this that won't break other cases?
Thanks for taking a look @violetbrina!
Do we even care much about the types for the meta field? It's all effectively just stored as a string in the SQL db anyway.
I think you're right that I'd be better off not caring about types for values stored in meta. They just look so much nicer in the GraphQL view when they're properly typed... I can live with it though.
As for the meta collapsing
A circumstance I thought up: What if we had two assays from the same sample, but from different sequencing runs? One assay is from the initial sequencing, and the second assay is from a "top-up" sequencing run a month later.
We might want to store a field "sequencing_date"
in the meta for each of the two assays, each with a different date. If we ingested these two assays simultaneously, the collapse_meta
function would concatenate the two separate "sequencing_date"
values into a list and store that in the meta of both assays. But really, we would want to keep those values separate for each assay, and avoid the collapse altogether.
I think it would make sense if the sequencing group that groups the two assays has the collapsed meta list. But each assay should have a distinct date in it's own meta.
A circumstance I thought up: What if we had two assays from the same sample, but from different sequencing runs? One assay is from the initial sequencing, and the second assay is from a "top-up" sequencing run a month later.
This would be a different Assay, so it gets its own assay meta (noting we group assays by filenames, as that has been the most stable way between all the providers we integrate with).
We might want to store a field "sequencing_date" in the meta for each of the two assays, each with a different date. If we ingested these two assays simultaneously, the collapse_meta function would concatenate the two separate "sequencing_date" values into a list and store that in the meta of both assays. But really, we would want to keep those values separate for each assay, and avoid the collapse altogether.
This shouldn't happen, because the collapse assay meta only happens when it finds multiple rows for the same grouped assay.
EG:
participant | sample | filename | assay_meta_1 |
---|---|---|---|
p1 | s1 | p1-s1-01_R1.fastq.gz | assay1_forward |
p1 | s1 | p1-s1-01_R2.fastq.gz | assay1_back |
p1 | s1 | p1-s1-02_R1.fastq.gz,p1-s1-02_R2.fastq.gz | assay2_mixed |
This gets grouped as two assays:
p1-s1-01_R*.fastq.gz
, and the collapse assay function is provided the appropriate 2 rowsp1-s1-02_R*.fastq.gz
, and the collapse assay meta function is only provided 1 row.It shouldn't happen, but if you have a forward with one experiment, and a matched reverse from another experiment, you probably got other problems to deal with...
I think it would make sense if the sequencing group that groups the two assays has the collapsed meta list. But each assay should have a distinct date in it's own meta
You can find this information on the assay, the sequencing group is just a thin grouping of sequencing assays, it should only store information about the grouping, NOT about specific experiments run within it. IE: We don't actually need to store the sequencing-type, technology, platform - we do because it's easier to query, but because we guarantee they're the same for every assay, you could just lookup one assay to get those values.
Hope this makes sense, maybe zoom might be a better back and forth medium :)
@illusional massive thanks again for coming back to repeatedly review this. I'm satisfied with the tests and your explanation makes sense!
I have gone back and updated the string_to_bool
function to also handle ints and floats. This meant a few other tests (test_parse_existing_cohort
, test_parse_ont_sheet
) also had to be updated to account for the new de-stringing behaviour in assay meta. I'm gonna go ahead and merge this and then open a release PR 😄
Some changes to the parsing module to allow more flexible meta fields to be added upon assay ingestion.
The main changes:
sample_file_map_parser.py
**kwargs
toSampleFileMapParser
instantiationgeneric_metadata_parser.py
assay_meta_columns: list[str]
(and remove unusedseq_meta_column
arg). To use this, pass a list of column header strings, and the parser will read the values in those columns and add them to the assay meta via the newget_assay_meta
function - See below example.Typing
typehints in favor of standard python typehints (Note that a lot of the diff is from this, if this is unnecessary or unwanted I'm happy to revert.)generic_parser.py
sequencing_technology == 'long-read'
is ingested, lift the assay meta fields into the sequencing group meta, like what is being done with exome and RNA assays. This will copy all meta fields except forreads
andreads_type
from theassay
to thesequencing_group
as it is created.E.g. some_sample_file_map.csv
Instantiate a
SampleFileMapParser
withignore_extra_keys
:Ingesting the filemap with this Parser will create assays and sequencing groups with meta fields: