Fix `MutationMap.query_bed`

an1lam commented 4 years ago

This includes three sub-changes to fix query_bed.

Make it so that null `bed_input` attrs don't trigger exceptions

Since most models in the Kipoi model zoo seem to not have bed_input set (with the exception of two, DeepCPG and one other), it seems wrong for MutationMap to fail on initialization when this value is unset in the config file.

Change _get_dl_bed_fields to return None when a dataloader's postprocessing.variant_effects attribute doesn't include a bed_input field. This will make it so that models like "DeepSEA/variantEffects" can be used with MutationMaps.

The ModelInfoExtractor class currently causes MutationMap's __init__ method to throw an exception if you try to create a MutationMap instance for a model (e.g. "DeepSEA/variantEffects", see here: https://github.com/kipoi/models/blob/master/DeepSEA/variantEffects/model.yaml) with no bed_input attribute under 'postprocessing > variant_effects' in its YAML file. The rest of the subsequent code checks whether exec_files_bed_keys is None and, in general, seems to be able to handle this case, so presumably this failure is unintentional.

One can reproduce this issue with the following code snippet (assumes relevant packages have been installed):

import kipoi
from kipoi_veff import MutationMap

deepsea = kipoi.get_model("DeepSEA/variantEffects", source="kipoi")
# Replace with actual path to hg19
dl_kwargs = {'fasta_file': '/path/to/hg19.fa' }
mm = MutationMap(deepsea, deepsea.default_dataloader, dl_kwargs)
mmp = mm.query_bed('/arbitary/bed/file/path.bed', scores=['logit'])

Only return non-null `bed3_to_vcf_index` when the temp file is non-empty

This code currently returns a temp file pointer for bed3_to_vcf_index even in cases where the file's empty and therefore invalid. This causes problems because, as the following code snippet shows, Python will throw an exception when it tries to read an empty tempfile.

import tempfile
fpath = tempfile.mktemp()
with open(fpath, 'rb') as f:
    pass

On the other hand, as one might expect, the following code runs successfully.

import tempfile
fpath = tempfile.mktemp()

with open(fpath, 'w') as f:
    f.write('foo\n')

with open(fpath, 'rb') as f:
    pass

Enable `query_bed` to be able to be run twice on the same file

Currently, if you try to run query_bed twice with the same file, it will fail with an error that looks like the following.

OSError: Filename '/home/stephenmalina/project/dat/10_random_tf_peaks.bed.gz' already exists, use *force* to overwrite

We want to overwrite the file it's failing for because the overwrite is non-destructive (it just optimizes the compression) and, assuming the original .bed file is uncompressed, will only overwrite a file previously created by a call to tabix.

an1lam commented 4 years ago

Don't merge yet, have a better fix actually.

an1lam commented 4 years ago

@Avsecz there's one more change that I think it makes sense to make: add force=True to the call to the tabix() function here so that running MutationMap.query_bed twice with the same bed file doesn't cause the following error:

OSError: Filename '/home/stephenmalina/project/dat/10_random_tf_peaks.bed.gz' already exists, use *force* to overwrite

Would you prefer a separate PR for this or should I add it here?

It's a tiny change on the same overall code path so I'm inclined to add it here, but I know that's also not ideal Git PR hygiene...

Avsecz commented 4 years ago

That's fine. You can include that fix in the PR here. Just make sure you add a high-level description of all the changes in the PR at the top

an1lam commented 4 years ago

Cool, will do shortly.

an1lam commented 4 years ago

Alright, added the final change and updated the description (first comment) to describe all three changes. Let me know if there's anything else you need from me.

an1lam commented 4 years ago

Also, I've been informally testing this by pulling in my changes and using the get_model and mutation map workflow for something I'm working on. Anything else I can/should do to test? It seems like CI is failing on master, so presumably that's not enough.

Avsecz commented 4 years ago

I think the tests on the master have some issues with the dependencies. Here is my attempt to fix this: https://github.com/kipoi/kipoi-veff/pull/37

an1lam commented 4 years ago

@Avsecz :clap:, success! Thank you!

Avsecz commented 4 years ago

Released to Pypi v0.3.1: https://pypi.org/project/kipoi-veff/0.3.1/.

Thanks for the PR!

kipoi / kipoi-veff