samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

Allow multiple VCFSimpleHeaderLine of same type #1531

Closed mjhipp closed 8 months ago

mjhipp commented 3 years ago

Description

Closes #277 Closes #500

Currently, it is not possible to add multiple VCFSimpleHeaderLines to a VCFHeader if they are of the same type (see issues linked above). A specific example, is adding ALT lines, which are included in 4.2 spec (that shows a header with multiple ALT lines on page 11).

To fix this, I added a new map and methods for working with SimpleVCFHeaderLines that are not of the default types.

I have seen another fix to this (#835), but it was closed due to large scope. I am hoping this change is small and useful enough to be considered. Thanks!

Things to think about before submitting:

codecov-io commented 3 years ago

Codecov Report

Merging #1531 (569ccf1) into master (2e3da45) will increase coverage by 0.000%. The diff coverage is 57.895%.

@@             Coverage Diff              @@
##              master     #1531    +/-   ##
============================================
  Coverage     69.407%   69.407%            
- Complexity      8924      8945    +21     
============================================
  Files            602       604     +2     
  Lines          35521     35649   +128     
  Branches        5904      5924    +20     
============================================
+ Hits           24654     24743    +89     
- Misses          8533      8555    +22     
- Partials        2334      2351    +17     
Impacted Files Coverage Δ Complexity Δ
src/main/java/htsjdk/variant/vcf/VCFHeader.java 86.842% <57.895%> (-3.274%) 82.000 <6.000> (+6.000) :arrow_down:
.../java/htsjdk/samtools/util/BufferedLineReader.java 48.387% <0.000%> (-20.034%) 7.000% <0.000%> (+1.000%) :arrow_down:
src/main/java/htsjdk/samtools/CRAMFileReader.java 74.762% <0.000%> (-0.238%) 47.000% <0.000%> (+2.000%) :arrow_down:
...jdk/samtools/util/BlockCompressedOutputStream.java 79.710% <0.000%> (ø) 34.000% <0.000%> (+2.000%)
src/main/java/htsjdk/io/Writer.java 0.000% <0.000%> (ø) 0.000% <0.000%> (?%)
src/main/java/htsjdk/io/AsyncWriterPool.java 72.222% <0.000%> (ø) 5.000% <0.000%> (?%)
src/main/java/htsjdk/io/HtsPath.java 62.195% <0.000%> (+10.879%) 20.000% <0.000%> (+5.000%)
clintval commented 3 years ago

Hey @lbergelson is this a PR your team would be willing to review? Many thanks if so!

lbergelson commented 3 years ago

@clintval Yeah, we're definitely going to get to it. Reviews have been really slow lately because working from home in the pandemic is killing my productivity.

cmnbroad commented 3 years ago

@mjhipp Thanks for taking a stab at this. While it does fix the duplicate line problem, it changes current behavior in a way that we should at least think about. Currently, any ID header line other than INFO/FORMAT (i.e., ALT, META, SAMPLE, PEDIGREE, or any other custom line) can be retrieved via getOtherHeaderLine(String key). I think with this change that will no longer be the case.

The real issue is that otherMeta probably shouldn't have any ID lines in it. Making the change in this PR and adding the new getOtherHeaderLine(String key, String id) kind of codifies that, though, and it feels like that's going in the wrong direction. An alternative fix would be to stop putting the ID lines in there at all, and store them elsewhere, and expose them through the new method. That would have the same backward compatibility issue as this fix, but it would also eliminate the need for the synthetic key names, which might be confusing since they look a lot like the ALT SV ID type hierarchy.

If we're going to break compatibility to fix this, I'd be inclined to favor doing the more complete fix though it's likely to be a bit more work and might impact other methods. Or maybe we should just resurrect the old PR and wait for that. @lbergelson any thoughts ? Having said all that, we should fix this somehow...

mjhipp commented 3 years ago

An alternative fix would be to stop putting the ID lines in there at all, and store them elsewhere, and expose them through the new method.

A nested map is one possibility for this. Where the outer map is from line type ("ALT", "PEDIGREE", etc) to an inner map, and the inner map is ID to metadata line. This would scale for any reasonable amount of line types, without having to hard code them (defining mAltMetaData map).

This would require a similar change to methods, but without synthetic key names.

Would add a new map at the top:

private final Map<String, Map<String, VCFSimpleHeaderLine>> mIdMetaData = new LinkedHashMap<String, Map<String, VCFSimpleHeaderLine>>();

Adding a new line would involve checking for the key, (making a new inner map if necessary), and adding to the inner map at the id hash.

which would change the get method I added to:

    /**
     * @param key    the header key or field type
     * @param id     the header id
     * @return the meta data line, or null if there is none
     */
    public VCFSimpleHeaderLine getIdHeaderLine(final String key, final String id) {
      return mIdMetaData.get(key).get(id);
    }

(probably on multiple lines with a null check in there)

Currently, any ID header line other than INFO/FORMAT (i.e., ALT, META, SAMPLE, PEDIGREE, or any other custom line) can be retrieved via getOtherHeaderLine(String key). I think with this change that will no longer be the case.

With the above changes, could possibly also change the existing getOtherHeaderLine to add some backwards compatibility. This could involve first checking mOtherMetaData for the key, and if not found, check mIdMetaData for that key. If the key is found in the outer map of mIdMetaData, take the first (or last) value in the inner map and cast to VCFHeaderLine.

mjhipp commented 3 years ago

I made the changes above. I think now the only backwards compatibility issue is that getOtherHeaderLines() will not return VCFSimpleHeaderLines, where it had before. To get those, you use getIdHeaderLines()

cmnbroad commented 3 years ago

After further discussing this internally and thinking more about it, I'm inclined to wait for a more complete fix to VCFHeader and address the numerous issues all at once rather than continue trying to make incremental fixes. I've resurrected and rebased my old #835 branch and will resubmit a new draft PR so we can try to find a path forward and get that merged.