Open chartl-rancho opened 3 years ago
What is the reason for wanting to avoid sorting on the strand? I'd love to help you solve your problem.
You could do gr.df.sort_values(["Chromosome", "Start", "End"])
. This will give you the DataFrame sorted on coordinates, but not the strand.
PyRanges are split into different dfs per chromosome/strand (for various reasons). This has the downside that interleaving different strands is not possible without workarounds. But if I knew what you wanted to achieve I'm sure I could help you do so :)
What is the reason for wanting to avoid sorting on the strand? I'd love to help you solve your problem.
I'm building a map of transcription factor binding sites (from ChIP-seq) for use in annotating variants (with e.g., VariantEffectPredictor); and borrowing the strand information from related enhancer RNA, where available. For the annotation to work (or just for compatibility with a large number of tools that read bed files) it needs to be completely sorted by genomic coordinate. In general, this can be done just prior to output, but it has been an issue when counting ("this is the 3rd enhancer upstream of gene X" -- these counts are out-of-order if the strands are different).
From an outside perspective, if a user is unfamiliar with how PyRanges breaks up the data, it can be disconcerting to call .sort()
and get records that are genomically out-of-order. It would be nice to have a strandedness=None
option in .sort()
(just like in nearest
) -- and the resulting dataframe would maintain the Strand
column, but (for the purpose of sorting and determining the dataframes) the strand would be treated as .
.
I should really create a faq about this. I do not want to make backward incompatible changes to pyranges, but I might throw together pyranges2 someday.
But what you are describing could be done in many ways, I’m sure.
If you had example input and expected output I could show you many ways to do what you want. But I guess these are not immediately obvious.
For example, you could make a function that did the transformation/computation you wanted and apply it to the pyranges.
I’m on mobile now so I can’t validate the example in the repl but:
def computation(df):
# do your computation on each chromosome df
return df
gr.apply(strandedness=False)
I do not want to make backward incompatible changes to pyranges
Sure; but adding a stranded
flag to .sort
would not be backwards incompatible so long as you left the default as True
. More generally you could add stranded=False
to the constructor itself, with:
def _init( ..., force_unstranded=False):
...
self.__dict__["dfs"] = create_df_dict(df, stranded and not force_unstranded)
I could fork this & generate a PR with both features, if you'd prefer.
I do not have a clear overview of all the ramifications your suggestion might have. It might be that some code, not just mine, but anyone's, depends on the dfs dict having a certain shape, like code that expects dfs["chr1", "+"]
to return the plus strand when invoked if there is a Strand column.
What I propose is to not split the df upon creation, but rather switch the .dfs
attribute of the pyranges-object to a property
which acts as if the data were split by splitting them upon its invocation. Then the data can be split when needed and you can sort the df like any object.
Possible downsides:
But I see that not having stuff sorted by strand can be confusing. So I'd seriously consider changing this. I'll have more time to test a prototype Friday.
The property-approach outlined above might be slow though. Then gr["chr1"]
might become surprisingly slow (for something that looks like a dict lookup). And chains of commands would need to split and merge upon every command instead of when serializing and deserializing.
I remember thinking a lot about this before deciding. I came to the conclusion that either solution had significant downsides. Now I've traded the opportunity to sort the pyranges with interleaving strands for great efficiency. The speed advantage was mostly seen on large datasets though, not small/medium ones.
I am open to seeing a sketch of your suggestion, but I would need to think a lot before actually adding anything.
The attached file does not sort properly by chromosome start and stop:
tf_locus.bed.gz
it appears that 'Strand' is included as part of the sort no matter what. The only way I have found to properly sort these intervals is to move 'Strand' to a different column, and re-create the PyRanges object to reset the index: