samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

Limited API visibility issues around attributes #1353

Open heuermh opened 5 years ago

heuermh commented 5 years ago

Apologies if I'm overlooking something in the APIs; the code block starting here

https://github.com/samtools/htsjdk/blob/698a4c37a735ff43f8b568d798000cd32ea17260/src/main/java/htsjdk/samtools/SAMTextWriter.java#L159

calls both TextTagCodec.encodeUnsignedArray and TextTagCodec.encode. Why does encode not call encodeUnsignedArray if appropriate? The problem is encode is public and encodeUnsignedArray is not (package-private).

Also an issue is SAMRecord.getBinaryAttributes() being protected visibility. This forces all external libraries to go though attribute conversion.

Can SAMRecord.getBinaryAttributes() be made public, and either TextTagCodec.encodeUnsignedArray be made public or (better yet) move the call to encodeUnsignedArray inside encode?

cmnbroad commented 5 years ago

Not sure how the maintainers feel, but ideally it does seem that encode should just call encodeUnsignedArray, especially since encode seems to silently do the wrong thing with unsigned arrays. But that could cause BWC problems. Certainly making encodeUnsignedArray public seems reasonable. Exposing the binary attributes directly from SAMRecord seems more questionable, since it removes the only guard we have against someone incorrectly constructing unsigned array attributes. @heuermh Could you use List<SAMTagAndValue> getAttributes() instead ?

heuermh commented 5 years ago

Could you use List<SAMTagAndValue> getAttributes() instead ?

We currently do, but in the use case of conversion of SAMRecord attributes to another representation, it is awfully convoluted. SAMTagAndValue drops the tag type which needs to be reconstructed later from the JVM type of the tag value. That method also copies the entire linked list structure of SAMBinaryTagAndValue to another list, which can be more expensive than necessary if the caller is filtering attributes.

But that could cause BWC problems.

I'm sorry, what is BWC?

cmnbroad commented 5 years ago

Sorry, BWC=backward compatibility. We talked about this a bit today during code review. It looks like encode doesn't have enough info to do the right thing given it's current signature anyway, so making encodeUnsignedArray public seems like a good change. As for SAMRecord, we'd be reluctant to expose the binary attributes directly, but since the tag type isn't explicitly stored in those anyway, changing SAMTagAndValue to include the type might be a better approach.