Open drewjbeh opened 3 years ago
Hey! Can you maybe paste the dataspec.yaml here and make sure the chromosome names are consistent across bigwig/bed/fasta files?
Hi! Thanks for the response. I have confirmed that chromosome names are consistent in all inputs, as well as in the chromosome size file used to generate bigwig files from bedGraph files. For example, I could find exact matches for all unique chromosome names in the peak files in the bedGraph file as well as in the genome fasta and the chromosome sizes file. One thing to note is that I am using the Gencode genome reference which has a slightly different naming convention for unassembled contigs (for example chr1_KI270706v1_random in the UCSC reference is KI270706.1 in the Gencode reference). However, the different naming is used consistently in all my input files. Not sure if this is somehow an issue, for example using premade models (bpnet9) or some other code that tries to match to a different reference automatically, although I couldn't tell that this would be an issue. Here is the full dataspec yaml with all tasks, and one set of input ChIP data for bias tracks:
fasta_file: /data/genome_misc/hg38/Homo_sapiens.GRCh38.primary_assembly.genome.fa
task_specs: # specifies tasks for each cell type
kiPS: # task name
tracks:
# List of bigwig files (1 or more) corresponding to the task
# The model will predict each track individually (here coverage of
# reads mapping to the positive and negative strand x 2 replicates) and
# the contribution scores will be averaged across all of these tracks
- inputs/lib973_kwt-iPSC_Brf1_pre_1_multi2_keepPrimary_dupRem.neg.bw
- inputs/lib973_kwt-iPSC_Brf1_pre_1_multi2_keepPrimary_dupRem.pos.bw
peaks: /data/analysis/20201022_ChIP_ATAC_masterAnalysis/human/Brf1/macs/blacklist_filter/lib973_kwt-iPSC_Brf1_pre_1_multi2_keepPrimary_dupRem.bam_summits_blacklistFilt.bed
kNPC:
tracks:
- inputs/lib975_kwt-NPC_Brf1_pre_1_multi2_keepPrimary_dupRem.neg.bw
- inputs/lib975_kwt-NPC_Brf1_pre_1_multi2_keepPrimary_dupRem.pos.bw
peaks: /data/analysis/20201022_ChIP_ATAC_masterAnalysis/human/Brf1/macs/blacklist_filter/lib975_kwt-NPC_Brf1_pre_1_multi2_keepPrimary_dupRem.bam_summits_blacklistFilt.bed
kMN:
tracks:
- inputs/lib977_kwt-MN_Brf1_pre_1_multi2_keepPrimary_dupRem.neg.bw
- inputs/lib977_kwt-MN_Brf1_pre_1_multi2_keepPrimary_dupRem.pos.bw
peaks: /data/analysis/20201022_ChIP_ATAC_masterAnalysis/human/Brf1/macs/blacklist_filter/lib977_kwt-MN_Brf1_pre_1_multi2_keepPrimary_dupRem.bam_summits_blacklistFilt.bed
kmDA:
tracks:
- inputs/lib979_kwt-mDa_Brf1_pre_1_multi2_keepPrimary_dupRem.neg.bw
- inputs/lib979_kwt-mDa_Brf1_pre_1_multi2_keepPrimary_dupRem.pos.bw
peaks: /data/analysis/20201022_ChIP_ATAC_masterAnalysis/human/Brf1/macs/blacklist_filter/lib979_kwt-mDa_Brf1_pre_1_multi2_keepPrimary_dupRem.bam_summits_blacklistFilt.bed
kCM:
tracks:
- inputs/lib981_kwt-CM_Brf1_pre_1_multi2_keepPrimary_dupRem.neg.bw
- inputs/lib981_kwt-CM_Brf1_pre_1_multi2_keepPrimary_dupRem.pos.bw
peaks: /data/analysis/20201022_ChIP_ATAC_masterAnalysis/human/Brf1/macs/blacklist_filter/lib981_kwt-CM_Brf1_pre_1_multi2_keepPrimary_dupRem.bam_summits_blacklistFilt.bed
bias_specs: # bias tracks for each cell type
kips_input:
tracks:
- inputs/lib404_hCh_kips_input_1_multi2_keepPrimary_dupRem.neg.bw
- inputs/lib404_hCh_kips_input_1_multi2_keepPrimary_dupRem.pos.bw
tasks:
- kiPS
- kNPC
- kMN
- kmDA
- kCM
However, I have tried with a reduced dataspec as well (also moved the peak bed file into the inputs directory and zipped, just in case that was somehow an issue):
task_specs: # specifies tasks for each cell type
kNPC:
tracks:
- inputs/lib975_kwt-NPC_Brf1_pre_1_multi2_keepPrimary_dupRem.neg.bw
- inputs/lib975_kwt-NPC_Brf1_pre_1_multi2_keepPrimary_dupRem.pos.bw
peaks: inputs/lib975_kwt-NPC_Brf1_pre_1_multi2_keepPrimary_dupRem.bam_summits_blacklistFilt2.bed.gz
bias_specs: # bias tracks for each cell type
knp_input:
tracks:
- inputs/lib405_hCh_knp_input_1_multi2_keepPrimary_dupRem.neg.bw
- inputs/lib405_hCh_knp_input_1_multi2_keepPrimary_dupRem.pos.bw
tasks:
- kNPC
Try to debug the dataset by loading the bpnet.datasets.StrandedProfile using the file spec. You can obtain dataspec ds
by running ds = bpnet.dataspecs.DataSpec.load_yaml(path)
(not sure if the method is called load_yaml, but it should something be along these lines).
Also, maybe try switching to absolute paths (although I doubt this will help).
I think I know what could possibly be wrong: Note that you should also adjust the held-out chromosomes here.
So I can successfully load the yaml with bpnet.dataspecs.DataSpec.load(path)
, but when I try bpnet.datasets.StrandedProfile(ds)
I get the following error. Seems to be a problem when mapping the chromosome lengths for specific data. This seems to return NA/Inf values that pandas cannot convert to int? When I manually run bpnet.extractors._chrom_sizes('path/to/genome.fa')
I get an OrderedDict of chromosome lengths that looks complete (no missing or NA values). I will try to pin down exactly what is causing the issue unless you may already know what is going on? Thanks!
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-19-e78484ebcc78> in <module>
----> 1 bpnet.datasets.StrandedProfile(ds)
~/anaconda3/envs/bpnet/lib/python3.6/site-packages/gin/config.py in gin_wrapper(*args, **kwargs)
1067 scope_info = " in scope '{}'".format(scope_str) if scope_str else ''
1068 err_str = err_str.format(name, fn_or_cls, scope_info)
-> 1069 utils.augment_exception_message_and_reraise(e, err_str)
1070
1071 return gin_wrapper
~/anaconda3/envs/bpnet/lib/python3.6/site-packages/gin/utils.py in augment_exception_message_and_reraise(exception, message)
39 proxy = ExceptionProxy()
40 ExceptionProxy.__qualname__ = type(exception).__qualname__
---> 41 raise proxy.with_traceback(exception.__traceback__) from None
42
43
~/anaconda3/envs/bpnet/lib/python3.6/site-packages/gin/config.py in gin_wrapper(*args, **kwargs)
1044
1045 try:
-> 1046 return fn(*new_args, **new_kwargs)
1047 except Exception as e: # pylint: disable=broad-except
1048 err_str = ''
~/anaconda3/envs/bpnet/lib/python3.6/site-packages/bpnet/datasets.py in __init__(self, ds, peak_width, seq_width, incl_chromosomes, excl_chromosomes, intervals_file, intervals_format, include_metadata, tasks, include_classes, shuffle, interval_transformer, track_transform, total_count_transform)
276 resize_width=max(self.peak_width, self.seq_width)
277 ).df.iloc[:, :3].assign(task=task)
--> 278 for task, task_spec in self.ds.task_specs.items()
279 if task_spec.peaks is not None])
280 assert list(self.dfm.columns)[:4] == [0, 1, 2, "task"]
~/anaconda3/envs/bpnet/lib/python3.6/site-packages/bpnet/datasets.py in <listcomp>(.0)
277 ).df.iloc[:, :3].assign(task=task)
278 for task, task_spec in self.ds.task_specs.items()
--> 279 if task_spec.peaks is not None])
280 assert list(self.dfm.columns)[:4] == [0, 1, 2, "task"]
281 if self.shuffle:
~/anaconda3/envs/bpnet/lib/python3.6/site-packages/bpnet/datasets.py in __init__(self, tsv_file, num_chr, label_dtype, mask_ambigous, incl_chromosomes, excl_chromosomes, chromosome_lens, resize_width)
130 center = (self.df[1] + self.df[2]) // 2
131 valid_seqs = ((center > self.resize_width // 2 + 1) &
--> 132 (center < self.df[0].map(chromosome_lens).astype(int) - self.resize_width // 2 - 1))
133 self.df = self.df[valid_seqs]
134
~/anaconda3/envs/bpnet/lib/python3.6/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
5546 else:
5547 # else, only a single dtype is given
-> 5548 new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,)
5549 return self._constructor(new_data).__finalize__(self, method="astype")
5550
~/anaconda3/envs/bpnet/lib/python3.6/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
602 self, dtype, copy: bool = False, errors: str = "raise"
603 ) -> "BlockManager":
--> 604 return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
605
606 def convert(
~/anaconda3/envs/bpnet/lib/python3.6/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, **kwargs)
407 applied = b.apply(f, **kwargs)
408 else:
--> 409 applied = getattr(b, f)(**kwargs)
410 result_blocks = _extend_blocks(applied, result_blocks)
411
~/anaconda3/envs/bpnet/lib/python3.6/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
593 vals1d = values.ravel()
594 try:
--> 595 values = astype_nansafe(vals1d, dtype, copy=True)
596 except (ValueError, TypeError):
597 # e.g. astype_nansafe can fail on object-dtype of strings
~/anaconda3/envs/bpnet/lib/python3.6/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
966
967 if not np.isfinite(arr).all():
--> 968 raise ValueError("Cannot convert non-finite values (NA or inf) to integer")
969
970 elif is_object_dtype(arr):
ValueError: Cannot convert non-finite values (NA or inf) to integer
In call to configurable 'StrandedProfile' (<class 'bpnet.datasets.StrandedProfile'>)
I did some digging and it seems the issue is coming from the TsvReader
class in the bpnet.datasets
module. There seems to be some replacing of "chr" with empty strings, or adding "chr" to the front of chromosome names where absent (over here). As I mentioned, the genome version I'm using from Gencode has the "chr1, chr2..." naming scheme in combination with "GL000008.2" naming. The code linked above checks the first item which happens to be one of these unassembled contigs, which doesn't have a "chr" in front. This means all chromosomes get a "chr" producing a self.df[0]
that looks something like this:
0 chrGL000008.2
1 chrGL000008.2
2 chrGL000008.2
3 chrGL000008.2
4 chrGL000008.2
...
4666 chrchrY
4667 chrchrY
4668 chrchrY
4669 chrchrY
4670 chrchrY
Hi there,
I am currently trying to train BPNet on some human ChIP-seq data we have for an RNA Pol III transcription factor. We have multiple cell lines that I would like to use as tasks for the model. I guess it's important to note that Pol III has very few targets in the human genome (and so does this TF we are working with, most of which overlap with Pol III targets), which mostly include tRNA genes, 5S rRNA and a few others. In total we get around 400 peaks in our stem cell model (similar to what others have seen), and we see great resolution and enrichment at expected sites, so we are sure the data is good.
I have set up BPNet in a conda environment and am running it from a notebook using that environment as the kernel. I am able to train on the test chip-seq data packaged with BPNet using this setup. However, when I try to use our data I get the following error:
I manually edited that
data_utils.py
script from kipoi to print thebatch
variable. Turns out its an empty list, hence theIndexError
. While with the BPNet test data it is correctly filled with one-hot data for the sequences and count data. I have checked the input data and can't find anything obviously wrong, it all looks quite similar to the BPNet data. I generated the stranded bigwig files as per the FAQ section of the example notebook, and summit files come from macs2. Any help would be much appreciated!Drew