How to handle invalid / broken / null data

eharkins commented 5 years ago

@metasoarous and I talked about this on Friday, but it seems probable (we recently experienced this) that when building data, there is a cluster / family that has null or undefined values where there should be data because of an error in the pipeline somewhere. There are two philosophies of how to handle this that we can think of:

A: We should filter out these data from our final collection of data at the pipeline / data build level. This is perhaps easier because it allows us to not have to check in olmsted for invalid data at all possible levels of the nested structure of one clonal family record.

B: However, the above does not take into account that one might want to know if there was an error in building one's data, in case this should be corrected to allow that data to be properly built. We could have it so that null data values do not break the code in olmsted, but rather show some warning about how they failed to build and cannot be viewed. This is more transparent but requires more checks.

Some other thoughts I have about this:

It seems like people using olmsted for now will either be building the data themselves or collaborating with those who are building the data, so maybe if there is an error in this process, it will be easy to make it known at the time of building the data, so we don't have to wait for olmsted users to encounter the broken data in exploring in order to alert them of it.
The other thing is that it might not always be easy to tell whether a "broken" or errant datum is subjectively important or worth re-building. E.g. if a cluster / family is totally empty except for an id or some other minimum set of fields, it seems like there is no way for someone to know what they are missing by not having that datum. So, maybe we should just default to making it known in the pipeline that something went awry with that datum (this already happens to some degree - data are labelled as having experienced an error, but nothing is done about it programmatically AFAIK) so that we can do as we see fit to correct for it.

@lauradoepker maybe you have thoughts about how you'd prefer this to be dealt with?

eharkins commented 5 years ago

Meant to submit this on olmsted but maybe it belongs here, @metasoarous let me know if I should move it

lauradoepker commented 5 years ago

Thanks Eli @eharkins . In general, I think there should be a flag/alert somewhere to let us know when something isn't building and then we (the researchers who know which samples/families are important) can decide if we want to pursue troubleshooting or not. I think digital alerts are more reliable than us humans remembering to tell one another when things go awry.

With the main pipeline, though, I think it's great to move on pass the errors and complete the build with most the data instead of choking on fringe errors. So: I agree with you.

metasoarous commented 5 years ago

@lauradoepker Thanks for the input here. I more or less concur that this is the right way to go.

@eharkins With that in mind this is definitely an Olmsted issue, since that's where all the work is gonna be if we're just passing through the tombstones for these data points. Maybe there is a separate issue here about whether or not additional tombstone/error data could be passed through or some such, but for now let's close this.

eharkins commented 5 years ago

Ok, I can open the olmsted issue. First though, I don't follow what you think is the best way to go @metasoarous. When you say

if we're just passing through the tombstones for these data points

Do you mean allowing these data to build and be requested / read by olmsted?

@lauradoepker where would the alert ideally be (that notifies you of a failure to build a data point): In our script that builds data (post processing of the main pipeline), or in the web interface of olmsted?

metasoarous commented 5 years ago

Do you mean allowing these data to build and be requested / read by olmsted?

Yes, that's what I mean. Perhaps it wasn't as clearly implied by Laura's comment as I thought, but if she wants to see the tombstones somewhere, that somewhere should be Olmsted. Laura isn't going to be running the script that builds the data, so that's not going to do her any good.

matsengrp / cft

How to handle invalid / broken / null data #255