samtools / htslib

C library for high-throughput sequencing data formats
Other
785 stars 447 forks source link

Add new annot-tsv program #1619

Closed pd3 closed 9 months ago

pd3 commented 1 year ago

See also https://raw.githubusercontent.com/pd3/utils/master/annot-regs/doc.annot-regs.pdf https://github.com/pd3/utils/tree/master/annot-regs

pd3 commented 1 year ago

Mmm, I am confused why the tests complain about strdup being implicitly declared, even after #include <string.h> has been added, while tabix code gets away with it...?

jkbonfield commented 1 year ago

In subsequent emails I think I decided on better terminology of "input" and "annotation". "Source" still feels nebulous and not very descriptive of the function.

The usage has:

   -s, --source-file FILE          Source file to take annotations from
   -t, --target-file FILE          Target file to be extend with annotations from -s

From this, it sounds like "target" is our file we are extending, with annotations from "source".

The man page says:

       -f, --transfer SRC:TGT
           Comma-separated list of columns to transfer. If the SRC column does not exist,
           interpret it as the default value to fill in when a match is found or a dot (".") when
           a match is not found. If the TGT column does not exist, a new column is created. If
           the TGT column already exists, its values will be overwritten when overlap is found
           and left as is otherwise.

This is a bit ambiguous as it sort of implies it can work in either direction.

I'm still struggling to work out if it's completely symmetric and we basically have file1 file2 that are merged (akin to unix join), or whether we have a primary input file and an annotation file that is used to modify the input. If so, which is actually the primary input? I think this all hinges on the "If the SRC column does not exist" vs "If the TGT column does not exist" interpretation. They're not equal, so clearly one is more important than the other and one of the two files is the primary file and the other is the secondary file, but I find the flow of data hard to grok.

Think of it in this way. Let's say we have an awk script in a very strict layout:

$1 ~ /foo/ {$2="abc"}
$1 ~ /bar/ {$2="pqr"}
$1 ~ /ram/ {$2="xyz"}
{print}

We can feed any data to it and the script will amend specific matching columns based on the contents of other columns. It's clear here that a null script degenerates to "cat". It's clear the input is the input, and the script is the script. We never really do things like pipe a script into awk and specify a filename as input on the command line.

Here it's a little different as we have to have at least one matching column between two files (hence why I say it's like a more powerful "join"). That does change things. I don't understand the ordering though. What's the use case for it working on unordered data? What's the benefit here? It just feels like it'd make it far slower than necessary. Or the half way house: is there a benefit to one order (primary) and one unordered (annotation) file?

I'm pleased however to see the addition of "-o" for output since I last looked at this, and renaming the old overlap option.

pd3 commented 1 year ago

In subsequent emails I think I decided on better terminology of "input" and "annotation". "Source" still feels nebulous and not very descriptive of the function.

Unless there is a danger of confusion, let's preserve the existing naming please. It is to avoid conflicting short names, I already had to make sacrifices by abandoning the term "destination" which you disliked.

The naming was intended to imply the information flow and for that reason I liked best the original "source" and "destination". Using that naming would also prevent the next question:

This is a bit ambiguous as it sort of implies it can work in either direction.

I'm still struggling to work out if it's completely symmetric and we basically have file1 file2 that are merged (akin to unix join), or whether we have a primary input file and an annotation file that is used to modify the input.

It is not symmetrical and is not intended to imply a change in direction. It aims to explain that the argument to --transfer are column names, in the source and the target file, and what happens if such column does or does not exist in the source and the target file: if the column exists in both, values in the target file are overwritten by values from the source file on match or left as is when there is no match; if such column does not exist in the target file, a new column with that name will be created in the target file; if such column does not exist in the source file, the name is interpreted as a string to be filled in the target file. For example, if the column MATCH does not exist in either file, --transfer MATCH:MATCH (which can be abbreviated to --transfer MATCH) will create a new column MATCH in the target file and put the value MATCH whenever there is a match or a dot when there isn't.

@daviesrob, @whitwham could you please help to improve the wording of the man page and the usage text to make it clearer? Despite my best effort I am not succeeding in making this understandable.

I don't understand the ordering though. What's the use case for it working on unordered data? What's the benefit here? It just feels like it'd make it far slower than necessary. Or the half way house: is there a benefit to one order (primary) and one unordered (annotation) file?

I am confused by the question. There are two types of ordering you might be asking, by column and by row. You seem to be asking about column ordering, which is the only type of ordering mentioned throughout the man page and the code.

To state the obvious, the benefit of supporting tab-delimited files with columns in arbitrary order is that one can use the program on files with columns in arbitrary order. For example, on files that were exported from spreadsheets, people tend to use them a lot. Regarding ordering by row, even though that is not mentioned anywhere, the program can work with data unordered by row as well. That's thanks to the regidx library.

As a side note, comparing the program to awk, join etc is misleading. It is also a bit like grep and bedtools, but mostly provides a unique functionality that is not found elsewhere.

jkbonfield commented 1 year ago

I meant row ordering, so classic sorted vs unsorted (or non-position sorted) genomic data.

We have lots of algorithms that are efficient because they require sorted data. Eg samtools merge. Here you seem to be stating that both files can be unsorted. This obviously means it is using a less efficient strategy, which is fine if that is the user requirement. It also depends on the size of files you're designing it to operate on. Is this for small data of a few thousand rows, or are you planning for it to lift over annotations from files with 10s of millions of records?

I'm just asking where that requirement for operating on unsorted data comes from. What inputs are we dealing with that aren't sorted? (Even unix "join" requires sorted inputs)

jkbonfield commented 1 year ago

In subsequent emails I think I decided on better terminology of "input" and "annotation". "Source" still feels nebulous and not very descriptive of the function.

Unless there is a danger of confusion, let's preserve the existing naming please. It is to avoid conflicting short names, I already had to make sacrifices by abandoning the term "destination" which you disliked.

Maybe so, but I asked others in the NPG group where they thought the input came from and where the output went, and universally they said source and destination! Let's just agree to disagree on this one. I'm looking for better terms, not to rehash old arguments.

I still think input and output are by far the most universally understood terms. As you've agreed it's not symmetric and one file is the primary input, then why not call that file input, and the other something more appropriate to its function?

jkbonfield commented 1 year ago

With a trivial src and dst (transfer) file:

$ cat /tmp/src.1.txt
#chr    beg end tag3
1   14  15  a
1   35  35  b

$ cat /tmp/dst.1.txt
#chr    beg end tag1    tag2
1   14  15  x1  x2
1   35  35  y1  y2

I can do a nop function of specifying nothing to transfer over:

$ ./annot-tsv -s /tmp/src.1.txt -t /tmp/dst.1.txt  -c chr,beg,end
#[1]chr [2]beg  [3]end  [4]tag1 [5]tag2
1   14  15  x1  x2
1   35  35  y1  y2

I notice a few things. Firstly, the header line has changed format. Is that deliberate or just debugging left over? It's undocumented.

Also what you label as the dst file in your tests (old destination) is the default output, and not the source as I had expected! If we use -f to transfer tags, then these come from source and not the target file. This is still confusingly named therefore. PLEASE consider just renaming the input to the "input" as it's then crystal clear what it is.

I also tried stress testing. Removing one of the columns from one src row (I deleted the "b" at the end of the 2nd line) to get illegal data caused a core dump:

./annot-tsv -s /tmp/src.1.txt -t /tmp/dst.1.txt -f tag3
AddressSanitizer:DEADLYSIGNAL
=================================================================
==8978==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x0000004d3015 bp 0x7fff4ad0c9d0 sp 0x7fff4ad0c760 T0)
==8978==The signal is caused by a READ memory access.
==8978==Hint: this fault was caused by a dereference of a high value address (see register values below).  Disassemble the provided pc to learn which register was used.
    #0 0x4d3015 in __ac_X31_hash_string /nfs/users/nfs_j/jkb/work/samtools_master/htslib/./htslib/khash.h:401
    #1 0x4d3015 in kh_get_str2int /nfs/users/nfs_j/jkb/work/samtools_master/htslib/./htslib/khash_str2int.h:30
    #2 0x4d3015 in khash_str2int_has_key /nfs/users/nfs_j/jkb/work/samtools_master/htslib/./htslib/khash_str2int.h:69
    #3 0x4d3015 in process_line /nfs/users/nfs_j/jkb/work/samtools_master/htslib/annot-tsv.c:685
    #4 0x4d596a in main /nfs/users/nfs_j/jkb/work/samtools_master/htslib/annot-tsv.c:875
    #5 0x7f8fdbee3c86 in __libc_start_main /build/glibc-CVJwZb/glibc-2.27/csu/../csu/libc-start.c:310
    #6 0x41d939 in _start ??:?

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /nfs/users/nfs_j/jkb/work/samtools_master/htslib/./htslib/khash.h:401 in __ac_X31_hash_string
==8978==ABORTING
Aborted

We need to stress test more corner cases so it's robust.

jkbonfield commented 1 year ago

What specifies the ordering?

My source file has columns "tag3" and "m" in that order, but I always get them in the opposite order regardless of what I specify in the -f option.

@ seq4d[samtools.../htslib]148; cat /tmp/src.1.txt
#chr    beg end tag3    m
1   14  15  a   1
1   35  35  b   01
@ seq4d[samtools.../htslib]; ./annot-tsv -s /tmp/src.1.txt -t /tmp/dst.1.txt -f m,tag3
#[1]chr [2]beg  [3]end  [4]tag1 [5]tag2 [6]m    [7]tag3
1   14  15  x1  x2  1   a
1   35  35  y1  y2  01  b
@ seq4d[samtools.../htslib]; ./annot-tsv -s /tmp/src.1.txt -t /tmp/dst.1.txt -f tag3,m
#[1]chr [2]beg  [3]end  [4]tag1 [5]tag2 [6]m    [7]tag3
1   14  15  x1  x2  1   a
1   35  35  y1  y2  01  b

Also, please consider making the usage appear to stdout unless an error has occurred. This is already a cause of a lot of "known failures" in the Samtools test harness, and we shouldn't be adding more here. Specifically an error is an error (eg anno-tsv --unknown-opt), but a request for usage (anno-tsv -h) is not. Same with exit codes. -h should have exit code 255. (Nor should errors really - you probably wanted exit(1) instead of exit(-1) there)

pd3 commented 1 year ago

I can do a nop function of specifying nothing to transfer over:

In this mode it functions as grep would.

pd3 commented 1 year ago

I notice a few things. Firstly, the header line has changed format. Is that deliberate or just debugging left over? It's undocumented.

That's a deliberate feature. If it is bothersome to anyone, we can add an option to not modify the header. I would leave this for future improvements though.

pd3 commented 1 year ago

Also what you label as the dst file in your tests (old destination) is the default output, and not the source as I had expected! If we use -f to transfer tags, then these come from source and not the target file. This is still confusingly named therefore. PLEASE consider just renaming the input to the "input" as it's then crystal clear what it is.

I really don't understand the problem. The documentation is clear on that, no? The program transfers columns from the file with source annotations (-s) to the destination file, newly "target" file (-t).

  -s, --source-file FILE          Source file to take annotations from
  -t, --target-file FILE          Target file to be extend with annotations from -s

To turn this around: both files are technically inputs and typically both have annotations. I can be equally confused here.

pd3 commented 1 year ago
./annot-tsv -s /tmp/src.1.txt -t /tmp/dst.1.txt -f tag3
AddressSanitizer:DEADLYSIGNAL
=================================================================
==8978==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x0000004d3015 bp 0x7fff4ad0c9d0 sp 0x7fff4ad0c760 T0)
==8978==The signal is caused by a READ memory access.
==8978==Hint: this fault was caused by a dereference of a high value address (see register values below).  Disassemble the provided pc to learn which register was used.
    #0 0x4d3015 in __ac_X31_hash_string /nfs/users/nfs_j/jkb/work/samtools_master/htslib/./htslib/khash.h:401
    #1 0x4d3015 in kh_get_str2int /nfs/users/nfs_j/jkb/work/samtools_master/htslib/./htslib/khash_str2int.h:30
    #2 0x4d3015 in khash_str2int_has_key /nfs/users/nfs_j/jkb/work/samtools_master/htslib/./htslib/khash_str2int.h:69
    #3 0x4d3015 in process_line /nfs/users/nfs_j/jkb/work/samtools_master/htslib/annot-tsv.c:685
    #4 0x4d596a in main /nfs/users/nfs_j/jkb/work/samtools_master/htslib/annot-tsv.c:875
    #5 0x7f8fdbee3c86 in __libc_start_main /build/glibc-CVJwZb/glibc-2.27/csu/../csu/libc-start.c:310
    #6 0x41d939 in _start ??:?

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /nfs/users/nfs_j/jkb/work/samtools_master/htslib/./htslib/khash.h:401 in __ac_X31_hash_string
==8978==ABORTING
Aborted

Thank you, added a commit to address issues like this https://github.com/samtools/htslib/pull/1619/commits/ee299e355665ee5443449d5cc26ea714f19ea90e. More breaking test cases will be much appreciated

pd3 commented 1 year ago

What specifies the ordering?

My source file has columns "tag3" and "m" in that order, but I always get them in the opposite order regardless of what I specify in the -f option.

New columns are only appended, never added among existing ones. I suspect the dst.1.txt file in this example contains the column m already? Otherwise in my tests the ordering of new columns can be influenced by giving them in different order in the -f option.

Reordering of existing columns in the target/destination file is not considered in this version, perhaps it could be addressed in a future enhancement request.

pd3 commented 1 year ago

Also, please consider making the usage appear to stdout unless an error has occurred. This is already a cause of a lot of "known failures" in the Samtools test harness, and we shouldn't be adding more here. Specifically an error is an error (eg anno-tsv --unknown-opt), but a request for usage (anno-tsv -h) is not. Same with exit codes. -h should have exit code 255. (Nor should errors really - you probably wanted exit(1) instead of exit(-1) there)

The exit code for -h was now changed to EXIT_SUCCESS and EXIT_FAILURE otherwise, https://github.com/samtools/htslib/pull/1619/commits/1424ca621d1ce25504849aed2ec90624c0fba5cf

daviesrob commented 9 months ago

Rebased, with some squashing of a couple of trivial bug-fix commits, and some adjustments from me...