tskit-dev / tskit

Population-scale genomics
MIT License
155 stars 73 forks source link

a way to read .trees metadata without loading the whole file would be useful #1854

Closed bhaller closed 2 years ago

bhaller commented 3 years ago

This came up in SLiM (https://github.com/MesserLab/SLiM/discussions/233) but I think it has general utility, as I'll describe. What the user wants is a way to get the metadata from a .trees file without the big overhead of loading the whole .trees into memory (whether as tables or a tree sequence); just the metadata.

In SLiM they want this so they can read in parameters that they previously put into the metadata, before the point in their script where they would actually load in the .trees data.

This seems generally useful to me because someone might wish to, e.g., loop in Python over the .trees files in a directory that contains many of them, and do "something" with each .trees file that has metadata with a certain property. Like: process all the .trees files that come from msprime but not those from SLiM, or copy all those where their parameter XYZZY had a value of 15.5 into a different folder, or whatever.

So for SLiM it'd be great to have C API for this; for other uses Python API seems called for. Would be nice to have it in C API 1.0 – we would use it immediately in SLiM – but it's not a big deal if it isn't.

grahamgower commented 3 years ago

This seems generally useful to me because someone might wish to, e.g., loop in Python over the .trees files in a directory that contains many of them, and do "something" with each .trees file that has metadata with a certain property.

I have done exactly this, but to get the provenance info (I bodged parameters in here before tskit had metadata). The kastore code is fairly minimal. However I agree that it makes sense to have a small kastore wrapper in tskit that that is specifically for metadata.

jeromekelleher commented 3 years ago

I can definitely see the value of adding this to the Python API, and it would be simple enough to implement. Pulling the metadata out of the kastore and decoding it is definitely handy, and not trivial to implement. The only downside is a run-time dependency on the kastore Python package, but that's no big deal. I guess the simplest thing about be to call the function load_metadata or something? (Or maybe peek_metadata?)

I'm less enthusiastic about adding it to the C API, mainly because it really only is a few lines of kastore code to do (and anyone using the tskit C API already has kastore). Since we're not decoding the metadata in C, tskit would just give you back a bunch of bytes. Then, would you want to return just the metadata, or would you want the schema also? The only argument for keeping it in tskit I would imagine is that so client code wouldn't need to know about the internal file format, but it's pretty simple and I would be very surprised if the way of accessing the top level metadata ever changed. We probably haven't made any formal contracts about the file format, but the top level details won't change now.

I could be convinced otherwise though, of course, if anyone wants to suggest a C API.

bhaller commented 3 years ago

I'm less enthusiastic about adding it to the C API, mainly because it really only is a few lines of kastore code to do (and anyone using the tskit C API already has kastore). Since we're not decoding the metadata in C, tskit would just give you back a bunch of bytes. Then, would you want to return just the metadata, or would you want the schema also?

Schema also, I suppose?

The only argument for keeping it in tskit I would imagine is that so client code wouldn't need to know about the internal file format, but it's pretty simple and I would be very surprised if the way of accessing the top level metadata ever changed. We probably haven't made any formal contracts about the file format, but the top level details won't change now.

I could be convinced otherwise though, of course, if anyone wants to suggest a C API.

Well, we can certainly do that "few lines of kastore code" ourselves in SLiM. Seems like a cleaner/better design for those lines to live inside tskit, and I personally would have no idea what those few lines of kastore code would actually look like so there's that :->, but with a helping hand I'm sure we can get it done. As for suggesting a C API, perhaps that would be best done by @petrelharp? I have no idea. :->

jeromekelleher commented 3 years ago

Thinking about this a bit more, I think the right approach is to add a new flag TSK_LOAD_SKIP_TABLES (or something) which is provided to table_collection_load. With this flag, only the top-level data for a tree sequence will be loaded in and set on the table collection instance. The tables will be ignored, and therefore have zero rows in the returned table collection/tree sequence.

This is the simplest approach because:

a) We'll nearly always want to look at things like the file format version anyway if peeking at the metadata b) We don't need to come up with any new APIs or ways of packaging up just the metadata/schema. It'll just be there on an empty table collection/tree sequence. c) It'll be easy to implement - just put a conditional around the x_table_load functions here

This will also work in Python, as we can add a boolean flag to the load function and it'll take care of itself.

We'd need some help to get this done if you'd like it before 1.0.0 @bhaller/@grahamgower - we're already snowed under with stuff to get wrapped up.

bhaller commented 3 years ago

I'm happy to take a crack at it, but will probably flail. :-> It looks like tsk_table_collection_load() has an options already, so I can define a new flag for that, and I guess add a bool flag to tsk_table_collection_loadf_inited() that uses that passed in; that's not public API I guess, so I can add a bbol to its interface I suppose? Then obvs easy to if out the x_table_load functions in tsk_table_collection_loadf_inited() based on that flag.

I have absolutely no idea how to do the Python-side work at all, though – you say "add a boolean flag to the load function" but mysteries abound. :-> Wouldn't know how to test the python side (or the C side, for that matter) properly even if I did get it working. Also not sure what the protocol is for all of this – fork tskit, make a local branch in my fork, commit something to that local branch, push the commit to my GitHub fork, turn it into a PR in GitHub? And then do I need to do something to make it a PR for the "real" tskit, not just for my fork of tskit? You probably don't realize how clueless I am in Git, I'm very accustomed to working in my own fiefdom and I never do anything complicated. :-> Best might be if I could watch someone else like @grahamgower or @petrelharp fix this over zoom; I'd certainly be interested in learning how to do this kind of minor fix!

jeromekelleher commented 3 years ago

Thanks @bhaller! We could leave out the Python side of things on the initial pass, it's an easy one to fill in later.

Re the workflow, the dev docs should answer your questions (if they don't, it's a bug). We've just done a format switch, so if you see any issues, please do let us know.

bhaller commented 3 years ago

Wow, you guys are organized – dev docs! OK @jeromekelleher, I will take a crack at it in the next couple days, unless @clwgg (who started this whole thing! :->) wants to have a go. :->

clwgg commented 3 years ago

I'm happy to give it a try!

jeromekelleher commented 3 years ago

Great! Feel free to ping questions here if you hit any issues @clwgg!

benjeffery commented 3 years ago

@clwgg @bhaller this is a great opportunity to debug the dev docs, so please don't hold back on corrections or suggestions.

clwgg commented 3 years ago

@benjeffery thanks, the dev docs are pretty fantastic! They made it much much easier to get into the project from the outside, especially when touching the backend for the first time. I just opened a PR that fixes two broken links I found (https://github.com/tskit-dev/tskit/pull/1880) -- otherwise I don't think they missed anything I felt I needed.

clwgg commented 3 years ago

alright, I just opened a (very-much-a-)draft PR for this feature. looking forward to your feedback (:

bhaller commented 2 years ago

Hi @clwgg, @benjeffery. Just wondering whether this is likely to make it into C API 1.0. @clwgg, do you care? (If it makes it into C API 1.0 then it will make it into SLiM 3.7; if not, probably not.)

clwgg commented 2 years ago

Hi! So the C side is basically ready barring some additions to the documentation -- on the Python side (which is less relevant here but part of the PR) I still have some tests to add but the PR is essentially "feature complete". I hope to finish up the tests and documentation this coming weekend. That's just my two cents -- I can't speak to the merge window of the API version bump (:

jeromekelleher commented 2 years ago

It's basically ready, we'll get it in for 0.4.0/1.0

bhaller commented 2 years ago

Hi all. Looking into using this now. I'm surprised to see that when TSK_LOAD_SKIP_TABLES is set, it loads the reference sequence. The goal here, as I see it, is to get the metadata; I don't really see what use the reference sequence would be in this context (what's a scenario where someone wants the reference sequence without actually wanting to load the tables too?). For a 1e10 length chromosome, the reference is ~10 GB, so we really don't want to go loading it in when we don't need it; this variant is supposed to be lightweight and fast, I think. I'm going to reopen this issue (or should I open a new issue?); of course if you really think this is correct functionality then go ahead and close it again :-> But my vote is for either (a) not loading the reference sequence with TSK_LOAD_SKIP_TABLES, or (b) adding another flag that lets me specify that desire as well. Thanks!

clwgg commented 2 years ago

Hi! I agree that there will be more use-cases of TSK_LOAD_SKIP_TABLES where we wouldn't want to load the reference than those where we do (if there are any). Still, I suppose the reference is more "top-level" than "table" information, so in the way the flag is currently documented it also doesn't feel totally right to just skip the reference as well? So maybe a TSK_LOAD_SKIP_REF flag (that could even be implied by TSK_LOAD_SKIP_TABLES and documented as such) would be more clear? FWIW, the current reference code got merged right before the skip_tables code, so this might have also just been a rebase mishap on my part in that I didn't catch the newly added tsk_table_collection_load_reference_sequence in the TSK_LOAD_SKIP_TABLES conditional -- I'm of course happy to fix that if that's indeed the desired solution.

bhaller commented 2 years ago

Agreed, if loading the reference sequence is simply taken out then the doc should change, since the ref seq is "top level" not "tables".

benjeffery commented 2 years ago

when TSK_LOAD_SKIP_TABLES is set, it loads the reference sequence

The ref seq behaviour is still under development, so we hadn't yet considered it's interaction with this feature. Thanks for the reminder though and sorry you are having to be on the bleeding edge!

jeromekelleher commented 2 years ago

You've preempted a few issues that have cropped up while I've been working through the details in #1944 and didn't have time to lay out properly @bhaller. I'm going to close this issue and open some more so we can discuss in isolation.

bhaller commented 2 years ago

Thanks @benjeffery and @jeromekelleher. No worries, I chose to walk out onto the bleeding edge here. :->