hammerlab / cohorts

Utilities for analyzing mutations and neoepitopes in patient cohorts
Apache License 2.0
20 stars 4 forks source link

Report on but do not error when patient has no RNA data #245

Open jburos opened 7 years ago

jburos commented 7 years ago

Here I'm trying to achieve a goal of allowing an analysis on a cohort where some but not all records have RNA sequencing data. Currently, if one is using a summary function like cohorts.functions.expressed_missense_snv_count, you will see an exception if some but not all records have RNA expression data.

Instead, we want to print a warning for these patients and populate the resulting df with np.NaN for any patient that does not contain RNA sequencing data where the filter function depends on it. This behavior would then be consistent with (for example) that of other summary functions like missense_snv_count for patients that do not have VCF files.

Aside: The correct way to handle this in most situations is to require that the cohort be filtered to those Patients having RNA data prior to analysis. This would make the exception desirable. However there are scenarios (like when using as_dataframe) where one might desire a different behavior. This however, runs the risk of a user ignoring the INFO messages & proceeding with analysis in error. This might be something we want to allow the user to configure, ie whether to allow NaN values in the cohort or not.

At any rate, for now, this PR does a few things:

  1. Create new error classes for missing BAM files of various types
  2. Catch those errors if they are used to filter variants or effects (not sure if this should apply to polyphen?)
  3. Report those errors as info using the logger.

Also, for the sake of consistency, this PR:

  1. Modifies earlier print statements when VCFs are not found for a patient to also use these error classes & report as INFO using the logger (previously these were print statements).

Note that I refactored some parts of the Cohort._load_single_patient_variants and Cohort._load_single_patient_effects in order to throw & report on these scenarios.

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-0.2%) to 52.347% when pulling 7114c6ef88a3c49ed5cef66f1a83c6504fe0e2c2 on fix-expression-errors into ea31881381e592cd26f6125a4723ccd74796ec0e on master.

jburos commented 7 years ago

Just noticed that these are all reported using logger.info() ; I wonder if they should instead be warnings? Edited the text above to reflect this.

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-0.2%) to 52.347% when pulling cd9e2ffc98f2cfaab95e43badae7d94461bfa397 on fix-expression-errors into ea31881381e592cd26f6125a4723ccd74796ec0e on master.

tavinathanson commented 7 years ago

@jburos I generally agree with your aside. In lung, I just have load_cohort(only_rna=True) and proceed from there. I'm not sure what the downsides are of that?

jburos commented 7 years ago

@tavinathanson Thx. First this is a question of consistency, since this PR extends the logic currently in place for many functions to another class of functions.

But to elaborate on my aside, above, there are two primary/immediate reasons why this is undesirable for my use case - mainly when using as_dataframe to export data for a cohort.

  1. In some cases, I want to compare attributes (e.g. survival) for patients with & without missing data as part of an exploratory analysis. Without the ability to export data for the full cohort, including NaN values for missing data, I am unable to do this type of analysis.
  2. Many of my analysis make use of data where values are missing, either to impute data for these values or include the records to (e.g.) stabilize survival estimates.

In general, when for example working in R, I often want to export my data once & then move forward with analysis using a data frame in which I can see the full cohort's data. Re-exporting data for each analysis is not only inefficient, but also runs the risk of inconsistencies in computed values across exported instances.

Separately, I can imagine a use case where you might want to initialize a cohort once (which can take a non-trivial amt of time, even with caching) & then loop through various analyses irrespective of the fact that some data are missing for some patients in some analyses. This is where an unsuspecting user might ignore or forget that the sample sizes are different for each analysis. Maybe we want to go through each of our analyses & report on the sample size for each?

This is not always a problem by the way - for example, in our bladder-cohort analyses we didn't restrict all analyses to the same subset of patients as has all data available.

At any rate in general I propose we move this discussion to an issue (or set of issues), since they are somewhat beyond the scope of this PR. For now, I see two open questions:

  1. At what level to report these messages? As INFO or WARNING (warning actually seems more appropriate to me)?
  2. Do we want consistency across the various functions in how these are handled? If so, is this the right approach to achieve this?
coveralls commented 7 years ago

Coverage Status

Coverage decreased (-0.2%) to 52.314% when pulling 1eb71991e1f6c16cbee36985e0b23a57d1749f87 on fix-expression-errors into ea31881381e592cd26f6125a4723ccd74796ec0e on master.

tavinathanson commented 7 years ago

@jburos fair points. Since we have different use cases, and I want to make sure people don't inadvertently have missing data because they misspelled a BAM filename, how about we gate this on a new Cohort.fail_on_missing_bams that defaults to True?

jburos commented 7 years ago

@tavinathanson then, why not do the same for VCFs, etc? what about an allow_nan which defaults to False? If True, then NaN values in any of these fields would be "OK"

tavinathanson commented 7 years ago

@jburos I was just worried about confusion with clinical data, since we do allow that to be NaN.

jburos commented 7 years ago

@tavinathanson First, we should probably separate the issue of allowing NaN for variants but not expressed variants, vs the name of the parameter ;) my fault for confounding this.

I care less about the parameter name & more about consistency across summary-functions. Even if the flag were fail_on_missing_bams, shouldn't we also fail on missing vcfs? Or would this be a separate flag?

Regardless, was thinking when I wrote the aside that this would be nice to customize at the level of the analysis (ie within plot_benefit) as well as a cohort-wide default.

I was also thinking (one could argue) that this should apply also to clinical data -- ie fine for age to be missing for some patients until we are using it in an analysis.

Finally, I should probably separate the issues of a misspelled bamfile name from that of a patient not having a bamfile (i should rename the exception to reflect this). I'm pretty sure the misspelled-bam-file case will result in a different error at the moment.

jburos commented 7 years ago

@tavinathanson also FYI (no rush) I'm done making changes to this one, for whenever you are available to take a look. We should really add some test cases to this -- going to create an issue for that.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.2%) to 52.704% when pulling de9301d025d3893f7f81cea62f81b0a55069dfab on fix-expression-errors into ea31881381e592cd26f6125a4723ccd74796ec0e on master.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.2%) to 52.704% when pulling b7c813924dde3cdd17fcbe2417e5e671b8abce25 on fix-expression-errors into ea31881381e592cd26f6125a4723ccd74796ec0e on master.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.1%) to 52.856% when pulling 3c5e86a9c732e6ddb7fa2d6d1ae0bbf0b7ced22a on fix-expression-errors into f9f4af2384760a656acadbfd74c5ccd8fade522a on master.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.1%) to 53.019% when pulling 4a26b1456872958fa9dd98a8845694adceb44bfc on fix-expression-errors into 5ed28a4d54c0646cd01acf47aa55ad30c7d61ac8 on master.