tskit-dev / tskit

Population-scale genomics
MIT License
153 stars 72 forks source link

keep_intervals() giving _tskit.LibraryError: Can't squash, flush, simplify or link ancestors... #2936

Closed TymekPieszko closed 2 months ago

TymekPieszko commented 5 months ago

Having converted the output of Relate to a tree sequence using relate_lib (https://github.com/leospeidel/relate_lib), I was trying to obtain a subsection of the full ts. I got the following error, suggesting a problem with edge metadata:

>>> ts = ts.keep_intervals([(0, 10_000)])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/scro4331/.local/lib/python3.9/site-packages/tskit/trees.py", line 6769, in keep_intervals
    tables.keep_intervals(intervals, simplify, record_provenance)
  File "/home/scro4331/.local/lib/python3.9/site-packages/tskit/tables.py", line 3854, in keep_intervals
    self.simplify(record_provenance=False)
  File "/home/scro4331/.local/lib/python3.9/site-packages/tskit/tables.py", line 3464, in simplify
    node_map = self._ll_tables.simplify(
_tskit.LibraryError: Can't squash, flush, simplify or link ancestors with edges that have non-empty metadata. (TSK_ERR_CANT_PROCESS_EDGES_WITH_METADATA)
jeromekelleher commented 5 months ago

That's interesting - I didn't realise people were using edge metadata. Out of curiosity, can you show us the edge table to see what's in there @TymekPieszko?

tables = ts.dump_tables()
print(tables.edges)

Is there an easy way to drop the metadata from a table @benjeffery?

TymekPieszko commented 5 months ago

Yes, looks as below:

╔═════════╤══════════════════╤══════════════════╤════════╤════════╤══════════════════════╗
║id       │left              │right             │parent  │child   │metadata              ║
╠═════════╪══════════════════╪══════════════════╪════════╪════════╪══════════════════════╣
║0        │184311562.50000000│184311895.00000000│36074489│     352│b'183302894 185513958'║
║1        │184311562.50000000│184311895.00000000│36074489│     353│b'183302894 185513958'║
║2        │ 73482179.50000000│ 73482487.50000000│18087521│     263│  b'71825215 74354148'║
║3        │ 73482179.50000000│ 73482487.50000000│18087521│     266│  b'71825215 74354148'║
...
jeromekelleher commented 5 months ago

I see, that's interesting - Relate is doing something useful with the metadata. Good for them!

Dropping the metadata is most easily done like this I think:

tables = ts.dump_tables()
edges = tables.edges.copy()
tables.edges.set_columns(left=edges.left, right=edges.right, parent=edges.parent, child=edges.child)
ts_no_edge_md = tables.tree_sequence()

Does this work for you?

benjeffery commented 5 months ago

Sorry, just seen this - yes, that is how I would drop the metadata too.

TymekPieszko commented 5 months ago

Yes, @jeromekelleher, this does prevent this particular error, thanks!

nspope commented 5 months ago

The edge metadata in question here are the "equivalent" edge in adjacent trees -- that is, Relate builds a bunch of marginal trees, and "links" edges across trees based on similarity of the subtending sample set. (More precisely what is stored are the IDs of the nodes subtended by the equivalent edges). Unless you're using that information, there's no need to keep it around in the tree sequence (and if you use the --compress flag in the conversion tool, the edges themselves will actually persist across multiple trees).

hyanwong commented 2 months ago

Just to say that the most recent tskit release has drop_metadata() functions on all tables (I think!)