Empty batch error loading training data into memory

Hi there,

I am currently trying to train BPNet on some human ChIP-seq data we have for an RNA Pol III transcription factor. We have multiple cell lines that I would like to use as tasks for the model. I guess it's important to note that Pol III has very few targets in the human genome (and so does this TF we are working with, most of which overlap with Pol III targets), which mostly include tRNA genes, 5S rRNA and a few others. In total we get around 400 peaks in our stem cell model (similar to what others have seen), and we see great resolution and enrichment at expected sites, so we are sure the data is good.

I have set up BPNet in a conda environment and am running it from a notebook using that environment as the kernel. I am able to train on the test chip-seq data packaged with BPNet using this setup. However, when I try to use our data I get the following error:

0it [00:00, ?it/s]0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/home/drew/anaconda3/envs/bpnet/bin/bpnet", line 8, in <module>
    sys.exit(main())
  File "/home/drew/anaconda3/envs/bpnet/lib/python3.6/site-packages/bpnet/__main__.py", line 38, in main
    argh.dispatch(parser)
  File "/home/drew/anaconda3/envs/bpnet/lib/python3.6/site-packages/argh/dispatching.py", line 174, in dispatch
    for line in lines:
  File "/home/drew/anaconda3/envs/bpnet/lib/python3.6/site-packages/argh/dispatching.py", line 277, in _execute_command
    for line in result:
  File "/home/drew/anaconda3/envs/bpnet/lib/python3.6/site-packages/argh/dispatching.py", line 260, in _call
    result = function(*positional, **keywords)
  File "/home/drew/anaconda3/envs/bpnet/lib/python3.6/site-packages/bpnet/cli/train.py", line 698, in bpnet_train
    gpu=gpu)
  File "/home/drew/anaconda3/envs/bpnet/lib/python3.6/site-packages/gin/config.py", line 1069, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/home/drew/anaconda3/envs/bpnet/lib/python3.6/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/home/drew/anaconda3/envs/bpnet/lib/python3.6/site-packages/gin/config.py", line 1046, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/home/drew/anaconda3/envs/bpnet/lib/python3.6/site-packages/bpnet/cli/train.py", line 410, in train
    num_workers=num_workers))
  File "/home/drew/anaconda3/envs/bpnet/lib/python3.6/site-packages/kipoi/data.py", line 408, in load_all
    return numpy_collate_concat([x for x in tqdm(self.batch_iter(batch_size, **kwargs))])
  File "/home/drew/anaconda3/envs/bpnet/lib/python3.6/site-packages/kipoi_utils/data_utils.py", line 21, in numpy_collate_fn
    if type(batch[0]).__module__ == 'numpy':
IndexError: list index out of range

I manually edited that data_utils.py script from kipoi to print the batch variable. Turns out its an empty list, hence the IndexError. While with the BPNet test data it is correctly filled with one-hot data for the sequences and count data. I have checked the input data and can't find anything obviously wrong, it all looks quite similar to the BPNet data. I generated the stranded bigwig files as per the FAQ section of the example notebook, and summit files come from macs2. Any help would be much appreciated!

Drew

Hey! Can you maybe paste the dataspec.yaml here and make sure the chromosome names are consistent across bigwig/bed/fasta files?

Hi! Thanks for the response. I have confirmed that chromosome names are consistent in all inputs, as well as in the chromosome size file used to generate bigwig files from bedGraph files. For example, I could find exact matches for all unique chromosome names in the peak files in the bedGraph file as well as in the genome fasta and the chromosome sizes file. One thing to note is that I am using the Gencode genome reference which has a slightly different naming convention for unassembled contigs (for example chr1_KI270706v1_random in the UCSC reference is KI270706.1 in the Gencode reference). However, the different naming is used consistently in all my input files. Not sure if this is somehow an issue, for example using premade models (bpnet9) or some other code that tries to match to a different reference automatically, although I couldn't tell that this would be an issue. Here is the full dataspec yaml with all tasks, and one set of input ChIP data for bias tracks:

fasta_file: /data/genome_misc/hg38/Homo_sapiens.GRCh38.primary_assembly.genome.fa
task_specs: # specifies tasks for each cell type
  kiPS: # task name
    tracks:
      # List of bigwig files (1 or more) corresponding to the task
      # The model will predict each track individually (here coverage of
      # reads mapping to the positive and negative strand x 2 replicates) and
      # the contribution scores will be averaged across all of these tracks
      - inputs/lib973_kwt-iPSC_Brf1_pre_1_multi2_keepPrimary_dupRem.neg.bw
      - inputs/lib973_kwt-iPSC_Brf1_pre_1_multi2_keepPrimary_dupRem.pos.bw
    peaks: /data/analysis/20201022_ChIP_ATAC_masterAnalysis/human/Brf1/macs/blacklist_filter/lib973_kwt-iPSC_Brf1_pre_1_multi2_keepPrimary_dupRem.bam_summits_blacklistFilt.bed
  kNPC:
    tracks:
      - inputs/lib975_kwt-NPC_Brf1_pre_1_multi2_keepPrimary_dupRem.neg.bw
      - inputs/lib975_kwt-NPC_Brf1_pre_1_multi2_keepPrimary_dupRem.pos.bw
    peaks: /data/analysis/20201022_ChIP_ATAC_masterAnalysis/human/Brf1/macs/blacklist_filter/lib975_kwt-NPC_Brf1_pre_1_multi2_keepPrimary_dupRem.bam_summits_blacklistFilt.bed
  kMN:
    tracks:
      - inputs/lib977_kwt-MN_Brf1_pre_1_multi2_keepPrimary_dupRem.neg.bw
      - inputs/lib977_kwt-MN_Brf1_pre_1_multi2_keepPrimary_dupRem.pos.bw
    peaks: /data/analysis/20201022_ChIP_ATAC_masterAnalysis/human/Brf1/macs/blacklist_filter/lib977_kwt-MN_Brf1_pre_1_multi2_keepPrimary_dupRem.bam_summits_blacklistFilt.bed
  kmDA:
    tracks:
      - inputs/lib979_kwt-mDa_Brf1_pre_1_multi2_keepPrimary_dupRem.neg.bw
      - inputs/lib979_kwt-mDa_Brf1_pre_1_multi2_keepPrimary_dupRem.pos.bw
    peaks: /data/analysis/20201022_ChIP_ATAC_masterAnalysis/human/Brf1/macs/blacklist_filter/lib979_kwt-mDa_Brf1_pre_1_multi2_keepPrimary_dupRem.bam_summits_blacklistFilt.bed
  kCM:
    tracks:
      - inputs/lib981_kwt-CM_Brf1_pre_1_multi2_keepPrimary_dupRem.neg.bw
      - inputs/lib981_kwt-CM_Brf1_pre_1_multi2_keepPrimary_dupRem.pos.bw
    peaks: /data/analysis/20201022_ChIP_ATAC_masterAnalysis/human/Brf1/macs/blacklist_filter/lib981_kwt-CM_Brf1_pre_1_multi2_keepPrimary_dupRem.bam_summits_blacklistFilt.bed

bias_specs: # bias tracks for each cell type
  kips_input:
    tracks:
      - inputs/lib404_hCh_kips_input_1_multi2_keepPrimary_dupRem.neg.bw
      - inputs/lib404_hCh_kips_input_1_multi2_keepPrimary_dupRem.pos.bw
    tasks:
      - kiPS
      - kNPC
      - kMN
      - kmDA
      - kCM

However, I have tried with a reduced dataspec as well (also moved the peak bed file into the inputs directory and zipped, just in case that was somehow an issue):

task_specs: # specifies tasks for each cell type
  kNPC:
    tracks:
      - inputs/lib975_kwt-NPC_Brf1_pre_1_multi2_keepPrimary_dupRem.neg.bw
      - inputs/lib975_kwt-NPC_Brf1_pre_1_multi2_keepPrimary_dupRem.pos.bw
    peaks: inputs/lib975_kwt-NPC_Brf1_pre_1_multi2_keepPrimary_dupRem.bam_summits_blacklistFilt2.bed.gz

bias_specs: # bias tracks for each cell type
  knp_input:
    tracks:
      - inputs/lib405_hCh_knp_input_1_multi2_keepPrimary_dupRem.neg.bw
      - inputs/lib405_hCh_knp_input_1_multi2_keepPrimary_dupRem.pos.bw
    tasks:
      - kNPC

Try to debug the dataset by loading the bpnet.datasets.StrandedProfile using the file spec. You can obtain dataspec ds by running ds = bpnet.dataspecs.DataSpec.load_yaml(path) (not sure if the method is called load_yaml, but it should something be along these lines).

Also, maybe try switching to absolute paths (although I doubt this will help).

I think I know what could possibly be wrong: Note that you should also adjust the held-out chromosomes here.

So I can successfully load the yaml with bpnet.dataspecs.DataSpec.load(path), but when I try bpnet.datasets.StrandedProfile(ds) I get the following error. Seems to be a problem when mapping the chromosome lengths for specific data. This seems to return NA/Inf values that pandas cannot convert to int? When I manually run bpnet.extractors._chrom_sizes('path/to/genome.fa') I get an OrderedDict of chromosome lengths that looks complete (no missing or NA values). I will try to pin down exactly what is causing the issue unless you may already know what is going on? Thanks!

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-19-e78484ebcc78> in <module>
----> 1 bpnet.datasets.StrandedProfile(ds)

~/anaconda3/envs/bpnet/lib/python3.6/site-packages/gin/config.py in gin_wrapper(*args, **kwargs)
   1067       scope_info = " in scope '{}'".format(scope_str) if scope_str else ''
   1068       err_str = err_str.format(name, fn_or_cls, scope_info)
-> 1069       utils.augment_exception_message_and_reraise(e, err_str)
   1070 
   1071   return gin_wrapper

~/anaconda3/envs/bpnet/lib/python3.6/site-packages/gin/utils.py in augment_exception_message_and_reraise(exception, message)
     39   proxy = ExceptionProxy()
     40   ExceptionProxy.__qualname__ = type(exception).__qualname__
---> 41   raise proxy.with_traceback(exception.__traceback__) from None
     42 
     43 

~/anaconda3/envs/bpnet/lib/python3.6/site-packages/gin/config.py in gin_wrapper(*args, **kwargs)
   1044 
   1045     try:
-> 1046       return fn(*new_args, **new_kwargs)
   1047     except Exception as e:  # pylint: disable=broad-except
   1048       err_str = ''

~/anaconda3/envs/bpnet/lib/python3.6/site-packages/bpnet/datasets.py in __init__(self, ds, peak_width, seq_width, incl_chromosomes, excl_chromosomes, intervals_file, intervals_format, include_metadata, tasks, include_classes, shuffle, interval_transformer, track_transform, total_count_transform)
    276                                             resize_width=max(self.peak_width, self.seq_width)
    277                                             ).df.iloc[:, :3].assign(task=task)
--> 278                                   for task, task_spec in self.ds.task_specs.items()
    279                                   if task_spec.peaks is not None])
    280             assert list(self.dfm.columns)[:4] == [0, 1, 2, "task"]

~/anaconda3/envs/bpnet/lib/python3.6/site-packages/bpnet/datasets.py in <listcomp>(.0)
    277                                             ).df.iloc[:, :3].assign(task=task)
    278                                   for task, task_spec in self.ds.task_specs.items()
--> 279                                   if task_spec.peaks is not None])
    280             assert list(self.dfm.columns)[:4] == [0, 1, 2, "task"]
    281             if self.shuffle:

~/anaconda3/envs/bpnet/lib/python3.6/site-packages/bpnet/datasets.py in __init__(self, tsv_file, num_chr, label_dtype, mask_ambigous, incl_chromosomes, excl_chromosomes, chromosome_lens, resize_width)
    130             center = (self.df[1] + self.df[2]) // 2
    131             valid_seqs = ((center > self.resize_width // 2 + 1) &
--> 132                           (center < self.df[0].map(chromosome_lens).astype(int) - self.resize_width // 2 - 1))
    133             self.df = self.df[valid_seqs]
    134 

~/anaconda3/envs/bpnet/lib/python3.6/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5546         else:
   5547             # else, only a single dtype is given
-> 5548             new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,)
   5549             return self._constructor(new_data).__finalize__(self, method="astype")
   5550 

~/anaconda3/envs/bpnet/lib/python3.6/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
    602         self, dtype, copy: bool = False, errors: str = "raise"
    603     ) -> "BlockManager":
--> 604         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    605 
    606     def convert(

~/anaconda3/envs/bpnet/lib/python3.6/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, **kwargs)
    407                 applied = b.apply(f, **kwargs)
    408             else:
--> 409                 applied = getattr(b, f)(**kwargs)
    410             result_blocks = _extend_blocks(applied, result_blocks)
    411 

~/anaconda3/envs/bpnet/lib/python3.6/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
    593             vals1d = values.ravel()
    594             try:
--> 595                 values = astype_nansafe(vals1d, dtype, copy=True)
    596             except (ValueError, TypeError):
    597                 # e.g. astype_nansafe can fail on object-dtype of strings

~/anaconda3/envs/bpnet/lib/python3.6/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
    966 
    967         if not np.isfinite(arr).all():
--> 968             raise ValueError("Cannot convert non-finite values (NA or inf) to integer")
    969 
    970     elif is_object_dtype(arr):

ValueError: Cannot convert non-finite values (NA or inf) to integer
  In call to configurable 'StrandedProfile' (<class 'bpnet.datasets.StrandedProfile'>)

I did some digging and it seems the issue is coming from the TsvReader class in the bpnet.datasets module. There seems to be some replacing of "chr" with empty strings, or adding "chr" to the front of chromosome names where absent (over here). As I mentioned, the genome version I'm using from Gencode has the "chr1, chr2..." naming scheme in combination with "GL000008.2" naming. The code linked above checks the first item which happens to be one of these unassembled contigs, which doesn't have a "chr" in front. This means all chromosomes get a "chr" producing a self.df[0] that looks something like this:

0       chrGL000008.2
1       chrGL000008.2
2       chrGL000008.2
3       chrGL000008.2
4       chrGL000008.2
            ...      
4666          chrchrY
4667          chrchrY
4668          chrchrY
4669          chrchrY
4670          chrchrY

kundajelab / bpnet

Empty batch error loading training data into memory #33