Open mtekman opened 11 months ago
This error still persists in v0.3.12
I have no idea why I'm the only one getting this. I've been poking through the code, and I refuse to believe that the error is to do with a faulty h5py or cooler library.
After much hair pulling, I think I've found the issue - though not the cause yet:
in lib/python3.10/site-packages/cooler/create/_create.py
(write_bins), the value of my chromnames
variable is:
{
1: 0,
10: 1,
11: 2,
12: 3,
13: 4,
14: 5,
15: 6,
16: 7,
17: 8,
18: 9,
19: 10,
2: 11,
3: 12,
4: 13,
5: 14,
6: 15,
7: 16,
8: 17,
9: 18, <------------------- Yes, okay makes sense
'9': 19, <----------------- Wait, what?
'GL456210.1': 20,
'GL456211.1': 21,
'GL456212.1': 22,
'GL456213.1': 23,
'GL456216.1': 24,
'GL456219.1': 25,
'GL456221.1': 26,
'GL456233.1': 27,
'GL456239.1': 28,
'GL456350.1': 29,
'GL456354.1': 30,
'GL456359.1': 31,
'GL456360.1': 32,
'GL456366.1': 33,
'GL456367.1': 34,
'GL456368.1': 35,
'GL456370.1': 36,
'GL456372.1': 37,
'GL456378.1': 38,
'GL456379.1': 39,
'GL456381.1': 40,
'GL456382.1': 41,
'GL456383.1': 42,
'GL456385.1': 43,
'GL456387.1': 44,
'GL456389.1': 45,
'GL456390.1': 46,
'GL456392.1': 47,
'GL456393.1': 48,
'GL456394.1': 49,
'GL456396.1': 50,
'JH584292.1': 51,
'JH584293.1': 52,
'JH584294.1': 53,
'JH584295.1': 54,
'JH584296.1': 55,
'JH584297.1': 56,
'JH584298.1': 57,
'JH584299.1': 58,
'JH584300.1': 59,
'JH584301.1': 60,
'JH584302.1': 61,
'JH584303.1': 62,
'JH584304.1': 63,
'MT': 64,
'X': 65,
'Y': 66
}
That is: I appear to have two chromosome 9's. I'm working on finding out where exactly this occurs... stay tuned
Investigating the source of the two chromosome 9's
If I look at cooler/create/_create.py: (write-chroms)
, variable chroms
, I get:
name length
0 1 195471971
·
17 8 129401213
18 9 113249638
19 9 124595110
·
54 JH584295.1 1976
·
64 MT 16299
and if I look at my Mus_musculus.GRCm38.dna.primary_assembly.fa.sizes
fed into the pipeline, I see:
1 195471971
·
8 129401213
9 124595110
MT 16299
·
JH584295.1 1976
so, one of the chromosome 9's is correct in terms of the size stated, and the other size appears to just be pulled out of thin air...
Worked it out: It's an issue with Pandas not parsing large dataframes for the correct column dtypes.
You can reproduce the same problem by:
import pandas as pd
import numpy as np
## Create a dataframe with n rows
n = 150000
pd.DataFrame(
np.random.randint(0,100, size=(n, 4)), columns=list('ABCD')
).join(pd.DataFrame({'chrom': ([9] * (n-10)) + (['X'] * 10)})
).to_csv("/tmp/mytab.csv")
Produces a table with n rows, containing only two chromosomes: 9 and X
| | A | B | C | D | chrom |
|--------+----+----+----+----+-------|
| 0 | 69 | 18 | 69 | 99 | 9 |
| 1 | 59 | 60 | 71 | 33 | 9 |
| · | · | · | · | · | · |
| · | · | · | · | · | · |
| 149998 | 78 | 0 | 21 | 13 | 9 |
| 149999 | 81 | 15 | 99 | 9 | 9 |
| 150000 | 89 | 4 | 34 | 44 | X |
| 150001 | 70 | 10 | 80 | 52 | X |
| · | · | · | · | · | · |
| · | · | · | · | · | · |
| 150009 | 76 | 70 | 11 | 40 | X |
tab = read.csv("/tmp/mytab.csv")
## - How many chromosomes?
unique(tab[["chrom"]]) ## Produces: "9" "X"
import pandas as pd
tab = pd.read_csv("/tmp/mytab.csv")
## (Warning: <stdin>:1: DtypeWarning: Columns (5) have mixed types. Specify dtype option on import or set low_memory=False.)
## - How many chromosomes?
set(tab["chrom"].values) ## Produces: {'9', 9, 'X'}
## - Are you sure? Let's cast chrom to categorical values
set(tab["chrom"].astype("category").values) ## *Still* produces: {'9', 9, 'X'}
## - Are you *super* sure? Let's cast chrom to string
set(tab["chrom"].astype("string").values) ## Finally produces: {'9', 'X'}
Note: For n = 130000, this problem does not occur.
So, somewhere in capcruncher a pandas dataframe needs to be read in with the correct dtypes, and maybe force all chromosome data to be string.
Describe the bug
An H5 file fails to be built from the sample parquet file.
To Reproduce
fails with:
Workaround
This can be fixed by patching the cooler library, though I'm not sure if it's correct to, and I'm sure the error occurs elsewhere upstream. This is the fix that I implemented though:
In
~/micromamba/envs/cc/lib/python3.10/site-packages/cooler/create/_create.py
, change lines 110-127 from:to: