pyranges / pyranges_1.x

PyRanges as a DataFrame subclass.
MIT License
10 stars 2 forks source link

"Chromosome" and "Strand" columns become "object" dtype instead of "category" after `.merge_overlaps()` #37

Open jeanmonet opened 1 month ago

jeanmonet commented 1 month ago

Hi, I've found this dtype discrepancy before & after applying merge_overlaps():

gtfpr.remove_nonloc_columns().dtypes

Chromosome    category
Start            int64
End              int64
Strand        category
dtype: object

However Chromosome & Strand columns become of object dtype after merge_overlaps():

gtfpr.remove_nonloc_columns().merge_overlaps().dtypes

Chromosome    object
Start          int64
End            int64
Strand        object
dtype: object

Is this expected behavior or is it a bug?


In addition, using .join_ranges with join_type="left" produces dtypes of type float64 for Start_b, End_B and other columns, whereas join_type="inner" keeps those to their original int64 dtype:


joined = gtf_ext.join_ranges(fragments, join_type="left")

    Chromosome  Start   End Start_b End_b   barcode count   ucount
0   chr1    55418   65419   56893.0 57061.0 CATGGATTCTTGCAGG-1  3.0 1.0
1   chr1    55418   65419   57033.0 57135.0 TTGTGCGAGTCATTTC-1  1.0 1.0

joined = gtf_ext.join_ranges(fragments, join_type="inner")

Chromosome  Start   End Start_b End_b   barcode count   ucount
0   chr1    55418   65419   56893   57061   CATGGATTCTTGCAGG-1  3   1
1   chr1    55418   65419   64419   64681   CGCACACAGCGTGCGT-1  1   1
endrebak commented 1 month ago

Definitely not intended behavior. Will fix when I'm done with my PhD revision