monarch-initiative / oncoexporter

Cancer data to GA4GH phenopacket
https://monarch-initiative.github.io/oncoexporter
MIT License
6 stars 1 forks source link

fixing format errors #71

Closed pnrobinson closed 5 months ago

pnrobinson commented 6 months ago

fixing some syntax errors that were introduced in the merge I think some of the test classes are mixing pytest and unittest and this may be leading to weirdness I

pnrobinson commented 6 months ago

I am thinking that there may be other time related functions we will need, but yes, this has no state. Having a class makes it easier to test but once dust has settled we can refactor.

Users of the phenopacket will appreciate having P3Y11M10D format if they look at the raw file. There is absolutely no use case for having more significant digits (so to speak), since the clinical data is very noisy and it is likely that in many case the number of days was calculated as age*365. I did not think that P12321D was a valid string, although it does not seem to be forbidden. We can implement this as an option, but then we need to implement more decoding software in applications...

ielis commented 6 months ago

I am thinking that there may be other time related functions we will need, but yes, this has no state. Having a class makes it easier to test but once dust has settled we can refactor.

I do not think we need class for this, because In Python, related code can also be put into a module. Class is for state. I would prefer not waiting for dust to settle, we can probably do this type of clean up now.

Users of the phenopacket will appreciate having P3Y11M10D format if they look at the raw file.

The same users who love staring at VCF files will definitely love us for this! :smirk_cat:

There is absolutely no use case for having more significant digits (so to speak), since the clinical data is very noisy and it is likely that in many case the number of days was calculated as age*365.

I am not sure we should really think along these lines. Regardless of what was done upstream of CDA, our code should not further amplify the errors.

I did not think that P12321D was a valid string, although it does not seem to be forbidden.

ISO8601 indeed does not forbid it and I was unaware of that when I started to play with magic numbers like 365.25 for the first time.

We can implement this as an option, but then we need to implement more decoding software in applications...

The software must always be coded up to the spec. In Phenopacket Schema, we say ISO8601 duration, so I have no bad conscience regarding P1000D

pnrobinson commented 5 months ago

What I mean is that the situation is as if calculator displays 17.856239 and 3 significant digits are justified. This is like displaying P23Y2M13DT7H12M when all we know is that P23Y is approximately correct. There is no case in medical analysis where you care about whether P23Y is different from P23Y1M3D. If we display number of days, we are making things more difficult for users without any benefit...

ielis commented 5 months ago

@pnrobinson I guess it is a question of scope.

I agree that some analyses at some point may choose to do this type of rounding. However, my understanding is that here we work on ETL software, and the analysis part will live downstream/elsewhere. Therefore, it should not be our decision to round data, unless we absolutely must (we do not).

I am sensitized to this type of scope creep, so I am pointing it out.