Inconsistencies between `--sample-name-list` and `--filter-samples-exclude`

capoony commented 1 year ago

Hi Lucas,

thanks for this new implementation. This will be very useful!

However, I am running into a problem when I want to exclude samples from a sync file. I provide a list of 737 names with --sample-name-list and then a list of 13 samples to be excluded with --filter-samples-exclude.

Now, I get the following error which indicates that the sample name list is not correctly updated when the samples are excluded. Is there a quick fix for that?

--sample-name-list(names.txt): Invalid sample names list that contains 737 name entries. This is incongruent with the input file, which contains 724 samples (after filtering, if a sample name filter was given).

Thanks, Martin

lczech commented 1 year ago

Hi Martin,

thanks for the feedback, I'll look into it! I'm currently refactoring the code on a larger scale, in order to get ready for publication. Once I'm done with that (next week-ish, I hope), I'll see if the error persists :-)

Happy to hear that you find the tool useful - stay tuned for more features in the near future!

Cheers Lucas

capoony commented 1 year ago

Hi Lucas,

thanks a lot!!

lczech commented 1 year ago

Hi @capoony,

finally circling back to this issue. I've re-worked most of the code related to sample naming now - I feel it was too messy and error-prone before. In particular, relying on the order of columns in sync files was not easy to work with, for example when specifying the --pool-sizes, and did not easily allow to work with multiple input files either.

So now (for now on the dev branch), I've instead implemented an approach as follows:

Samples from file formats that have sample names (such as VCF columns) use those names.
For file formats such as sync, the sample names are instead assigned based on the file name, so that /path/to/my_file.sync gives samples my_file.1, my_file.2, etc (unless your header is provided in the sync, in which case those sample names are used).
Then, these samples can all be renamed and filtered as needed (with redesigned options, which should solve your problem).
Similarly, providing pool sizes (for fst for instance) also uses these sample names.

For you, that would require to use the new --rename-samples-file option instead of --sample-name-list, using sample names (either from the file, or following the naming scheme as above) for the renaming, instead of relying on the sample order in the sync file. That should make it more robust and less error prone, and allows to use all options even when multiple input files are provided. The rest should work the same then. If you want, check out the dev branch; this will also be part of the next release (v0.3.0) then.

Hence, closing this issue now, but should you have a better idea of how to solve this, or encounter any more trouble, feel free to re-open :-)

Cheers Lucas

lczech / grenedalf

Inconsistencies between `--sample-name-list` and `--filter-samples-exclude` #3