h5py TypeError '<' not supported between instances of 'str' and 'int'

Describe the bug

An H5 file fails to be built from the sample parquet file.

To Reproduce

capcruncher interactions count \
      results/<sample>/<sample>.parquet \
 -o interim/pileups/counts_by_restriction_fragment/<sample>.hdf5 \
 -f  resources/restriction_fragments/genome.digest.bed.gz \
 -v <viewpoints.bed> -p 5 --assay capture

fails with:

*** TypeError: '<' not supported between instances of 'str' and 'int'

Workaround

This can be fixed by patching the cooler library, though I'm not sure if it's correct to, and I'm sure the error occurs elsewhere upstream. This is the fix that I implemented though:

In ~/micromamba/envs/cc/lib/python3.10/site-packages/cooler/create/_create.py, change lines 110-127 from:

    # Store bins
    try:
        chrom_dset = grp.create_dataset(
            "chrom", shape=(n_bins,), dtype=chrom_dtype, data=chrom_ids, **h5opts
        )
    except ValueError:
        # If too many scaffolds for HDF5 enum header,
        # try storing chrom IDs as raw int instead
        if chrom_as_enum:
            chrom_as_enum = False
            chrom_dtype = CHROMID_DTYPE
            chrom_dset = grp.create_dataset(
                "chrom", shape=(n_bins,), dtype=chrom_dtype, data=chrom_ids, **h5opts
            )
        else:
            raise
    if not chrom_as_enum:
        chrom_dset.attrs["enum_path"] = "/chroms/name"

to:

    try:
        chrom_dset = grp.create_dataset(
            "chrom", shape=(n_bins,), dtype=chrom_dtype, data=chrom_ids, **h5opts
        )
    except (ValueError, TypeError):
        # TypeError: '<' not supported between instances of 'str' and 'int'
        # If too many scaffolds for HDF5 enum header,
        # try storing chrom IDs as raw int instead
        if chrom_as_enum:
            chrom_as_enum = False
            chrom_dtype = CHROMID_DTYPE
            chrom_dset = grp.create_dataset(
                "chrom", shape=(n_bins,), dtype=chrom_dtype, data=chrom_ids, **h5opts
            )
        else:
            raise
    if not chrom_as_enum:
        chrom_dset.attrs["enum_path"] = "/chroms/name"

This error still persists in v0.3.12

I have no idea why I'm the only one getting this. I've been poking through the code, and I refuse to believe that the error is to do with a faulty h5py or cooler library.

After much hair pulling, I think I've found the issue - though not the cause yet:

in lib/python3.10/site-packages/cooler/create/_create.py (write_bins), the value of my chromnames variable is:

{
  1: 0,
  10: 1,
  11: 2,
  12: 3,
  13: 4,
  14: 5,
  15: 6,
  16: 7,
  17: 8,
  18: 9,
  19: 10,
  2: 11,
  3: 12,
  4: 13,
  5: 14,
  6: 15,
  7: 16,
  8: 17,
  9: 18, <------------------- Yes, okay makes sense
  '9': 19, <----------------- Wait, what?
  'GL456210.1': 20,
  'GL456211.1': 21,
  'GL456212.1': 22,
  'GL456213.1': 23,
  'GL456216.1': 24,
  'GL456219.1': 25,
  'GL456221.1': 26,
  'GL456233.1': 27,
  'GL456239.1': 28,
  'GL456350.1': 29,
  'GL456354.1': 30,
  'GL456359.1': 31,
  'GL456360.1': 32,
  'GL456366.1': 33,
  'GL456367.1': 34,
  'GL456368.1': 35,
  'GL456370.1': 36,
  'GL456372.1': 37,
  'GL456378.1': 38,
  'GL456379.1': 39,
  'GL456381.1': 40,
  'GL456382.1': 41,
  'GL456383.1': 42,
  'GL456385.1': 43,
  'GL456387.1': 44,
  'GL456389.1': 45,
  'GL456390.1': 46,
  'GL456392.1': 47,
  'GL456393.1': 48,
  'GL456394.1': 49,
  'GL456396.1': 50,
  'JH584292.1': 51,
  'JH584293.1': 52,
  'JH584294.1': 53,
  'JH584295.1': 54,
  'JH584296.1': 55,
  'JH584297.1': 56,
  'JH584298.1': 57,
  'JH584299.1': 58,
  'JH584300.1': 59,
  'JH584301.1': 60,
  'JH584302.1': 61,
  'JH584303.1': 62,
  'JH584304.1': 63,
  'MT': 64,
  'X': 65,
  'Y': 66
}

That is: I appear to have two chromosome 9's. I'm working on finding out where exactly this occurs... stay tuned

Investigating the source of the two chromosome 9's

If I look at cooler/create/_create.py: (write-chroms), variable chroms, I get:

    name    length
0   1   195471971
·
17  8   129401213
18  9   113249638
19  9   124595110
·
54  JH584295.1  1976
·
64  MT  16299

and if I look at my Mus_musculus.GRCm38.dna.primary_assembly.fa.sizes fed into the pipeline, I see:

1   195471971
·
8   129401213
9   124595110
MT  16299
·
JH584295.1  1976

so, one of the chromosome 9's is correct in terms of the size stated, and the other size appears to just be pulled out of thin air...

Worked it out: It's an issue with Pandas not parsing large dataframes for the correct column dtypes.

You can reproduce the same problem by:

Create Genomic Data

import pandas as pd
import numpy as np
## Create a dataframe with n rows
n = 150000
pd.DataFrame(
    np.random.randint(0,100, size=(n, 4)), columns=list('ABCD')
    ).join(pd.DataFrame({'chrom': ([9] * (n-10)) + (['X'] * 10)})
        ).to_csv("/tmp/mytab.csv")

Produces a table with n rows, containing only two chromosomes: 9 and X

|        |  A |  B |  C |  D | chrom |
|--------+----+----+----+----+-------|
|      0 | 69 | 18 | 69 | 99 | 9     |
|      1 | 59 | 60 | 71 | 33 | 9     |
|      · |  · |  · |  · |  · | ·     |
|      · |  · |  · |  · |  · | ·     |
| 149998 | 78 |  0 | 21 | 13 | 9     |
| 149999 | 81 | 15 | 99 |  9 | 9     |
| 150000 | 89 |  4 | 34 | 44 | X     |
| 150001 | 70 | 10 | 80 | 52 | X     |
|      · |  · |  · |  · |  · | ·     |
|      · |  · |  · |  · |  · | ·     |
| 150009 | 76 | 70 | 11 | 40 | X     |

Read Table in R

tab = read.csv("/tmp/mytab.csv")
## - How many chromosomes?
unique(tab[["chrom"]])     ## Produces:  "9" "X"

Read Table in Python (pandas)

import pandas as pd
tab = pd.read_csv("/tmp/mytab.csv")
## (Warning: <stdin>:1: DtypeWarning: Columns (5) have mixed types. Specify dtype option on import or set low_memory=False.)
## - How many chromosomes?
set(tab["chrom"].values)   ## Produces:  {'9', 9, 'X'}
## - Are you sure? Let's cast chrom to categorical values
set(tab["chrom"].astype("category").values)   ## *Still* produces:  {'9', 9, 'X'}
## - Are you *super* sure? Let's cast chrom to string
set(tab["chrom"].astype("string").values)   ## Finally produces:  {'9', 'X'}

Note: For n = 130000, this problem does not occur.

So, somewhere in capcruncher a pandas dataframe needs to be read in with the correct dtypes, and maybe force all chromosome data to be string.

sims-lab / CapCruncher