PyRanges.sort improperly sorts when strand information is present

pyranges / pyranges

Performant Pythonic GenomicRanges

https://pyranges.readthedocs.io

MIT License

445 stars 47 forks source link

PyRanges.sort improperly sorts when strand information is present #209

Open chartl-rancho opened 3 years ago

chartl-rancho commented 3 years ago

The attached file does not sort properly by chromosome start and stop:

tf_locus.bed.gz

>>> import pyranges
>>> dat = pyranges.read_bed('tf_locus.bed.gz')
>>> dat[dat.Chromosome == '8'].sort().Start
0      38150980
1      38150984
2      38150988
3      38150989
4      38151021
5      38151021
6      38151041
7      38151045
8      38151061
9      38151851
10     38153987
11     38154143
12     38154319
13    127733877
14    127733877
15    127734420
16    127734478
0      47960159
1      47960159
2      47960768
3      47960768
Name: Start, dtype: int32

it appears that 'Strand' is included as part of the sort no matter what. The only way I have found to properly sort these intervals is to move 'Strand' to a different column, and re-create the PyRanges object to reset the index:

>>> dat = pyranges.PyRanges(dat.assign('Strand_', lambda df: df.Strand).as_df().drop('Strand', axis=1)).sort()
>>> dat[dat.Chromosome == '8'].Start
0      38150980
1      38150984
2      38150988
3      38150989
4      38151021
5      38151021
6      38151041
7      38151045
8      38151061
9      38151851
10     38153987
11     38154143
12     38154319
13     47960159
14     47960159
15     47960768
16     47960768
17    127733877
18    127733877
19    127734420
20    127734478

endrebak commented 3 years ago

What is the reason for wanting to avoid sorting on the strand? I'd love to help you solve your problem.

You could do gr.df.sort_values(["Chromosome", "Start", "End"]). This will give you the DataFrame sorted on coordinates, but not the strand.

PyRanges are split into different dfs per chromosome/strand (for various reasons). This has the downside that interleaving different strands is not possible without workarounds. But if I knew what you wanted to achieve I'm sure I could help you do so :)

chartl-rancho commented 3 years ago

What is the reason for wanting to avoid sorting on the strand? I'd love to help you solve your problem.

I'm building a map of transcription factor binding sites (from ChIP-seq) for use in annotating variants (with e.g., VariantEffectPredictor); and borrowing the strand information from related enhancer RNA, where available. For the annotation to work (or just for compatibility with a large number of tools that read bed files) it needs to be completely sorted by genomic coordinate. In general, this can be done just prior to output, but it has been an issue when counting ("this is the 3rd enhancer upstream of gene X" -- these counts are out-of-order if the strands are different).

From an outside perspective, if a user is unfamiliar with how PyRanges breaks up the data, it can be disconcerting to call .sort() and get records that are genomically out-of-order. It would be nice to have a strandedness=None option in .sort() (just like in nearest) -- and the resulting dataframe would maintain the Strand column, but (for the purpose of sorting and determining the dataframes) the strand would be treated as ..

endrebak commented 3 years ago

I should really create a faq about this. I do not want to make backward incompatible changes to pyranges, but I might throw together pyranges2 someday.

But what you are describing could be done in many ways, I’m sure.

If you had example input and expected output I could show you many ways to do what you want. But I guess these are not immediately obvious.

For example, you could make a function that did the transformation/computation you wanted and apply it to the pyranges.

I’m on mobile now so I can’t validate the example in the repl but:

def computation(df):
    # do your computation on each chromosome df
    return df

gr.apply(strandedness=False)

chartl-rancho commented 3 years ago

I do not want to make backward incompatible changes to pyranges

Sure; but adding a stranded flag to .sort would not be backwards incompatible so long as you left the default as True. More generally you could add stranded=False to the constructor itself, with:

def _init( ..., force_unstranded=False):
   ...
   self.__dict__["dfs"] = create_df_dict(df, stranded and not force_unstranded)

I could fork this & generate a PR with both features, if you'd prefer.

endrebak commented 3 years ago

I do not have a clear overview of all the ramifications your suggestion might have. It might be that some code, not just mine, but anyone's, depends on the dfs dict having a certain shape, like code that expects dfs["chr1", "+"] to return the plus strand when invoked if there is a Strand column.

What I propose is to not split the df upon creation, but rather switch the .dfs attribute of the pyranges-object to a property which acts as if the data were split by splitting them upon its invocation. Then the data can be split when needed and you can sort the df like any object.

Possible downsides:

Many (most) operations would distort the sort order anyways
Slower (efficiency was the reason I split the df to begin with, but pandas might have become much faster at groupby (splitting) and concat (merging chromosome dfs) since I started. Virtually all operations require that you split the data so it will come at a cost for every operation.
Might break existing code.

But I see that not having stuff sorted by strand can be confusing. So I'd seriously consider changing this. I'll have more time to test a prototype Friday.

endrebak commented 3 years ago

The property-approach outlined above might be slow though. Then gr["chr1"] might become surprisingly slow (for something that looks like a dict lookup). And chains of commands would need to split and merge upon every command instead of when serializing and deserializing.

I remember thinking a lot about this before deciding. I came to the conclusion that either solution had significant downsides. Now I've traded the opportunity to sort the pyranges with interleaving strands for great efficiency. The speed advantage was mostly seen on large datasets though, not small/medium ones.

I am open to seeing a sketch of your suggestion, but I would need to think a lot before actually adding anything.