samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

Proposal: handling UCSC 2bit sequences as ReferenceFile #1417

Open lindenb opened 5 years ago

lindenb commented 5 years ago

Description

This is just a proposal, please don't commit.

This PR is a proposal to add a support for the 2bit fasta sequences : https://genome.ucsc.edu/goldenpath/help/twoBit.html

I wrote

TwoBitSequenceFile: https://github.com/lindenb/htsjdk/blob/pl_2bit/src/main/java/htsjdk/samtools/reference/TwoBitSequenceFile.java implements ReferenceSequenceFile and reads 2bit files.

I modified https://github.com/lindenb/htsjdk/blob/pl_2bit/src/main/java/htsjdk/samtools/reference/ReferenceSequenceFileFactory.java#L133 to handle the new file extension

Tell me if you're interested with this commit and I'll add more tests and fix the formatting.

PS: Another way to handle those kind custom references sequences would be to move statics methods from ReferenceSequenceFileFactory to non-static and give a chance to set a default instance.

ReferenceSequenceFileFactory.setInstance(my_instance_handling_2bit);
(...)
ReferenceSequenceFileFactory.getInstance().getReferenceSequenceFile(path);
yfarjoun commented 5 years ago

should the 2bit fasta be added as a hts-spec as well?

lbergelson commented 5 years ago

@lindenb Sorry for the very slow response. I think this would be a great addition to hstjdk. We were actually reading 2-bit in gatk using the ADAM project implementation, so it would be nice to have an htsjdk built in.

I did see you have an error due to a missing file at the moment:

- testBit2File *** FAILED ***
  htsjdk.samtools.SAMException: Error opening .2bit : src/test/resources/htsjdk/samtools/reference/Homo_sapiens_assembly18.trimmed.2bit
  at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:134)
  at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:95)
  at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:83)
  at htsjdk.samtools.reference.ReferenceSequenceFileFactoryTests.testBit2File(ReferenceSequenceFileFactoryTests.java:49)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:124)
  at org.testng.internal.Invoker.invokeMethod(Invoker.java:583)
  ...
  Cause: java.io.FileNotFoundException: src/test/resources/htsjdk/samtools/reference/Homo_sapiens_assembly18.trimmed.2bit (No such file or directory)
  at java.io.RandomAccessFile.open0(Native Method)
  at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
  at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
  at htsjdk.samtools.seekablestream.SeekableFileStream.<init>(SeekableFileStream.java:47)
  at htsjdk.samtools.seekablestream.SeekableStreamFactory$DefaultSeekableStreamFactory.getStreamFor(SeekableStreamFactory.java:103)
  at htsjdk.samtools.seekablestream.SeekableStreamFactory$DefaultSeekableStreamFactory.getStreamFor(SeekableStreamFactory.java:77)
  at htsjdk.samtools.reference.TwoBitSequenceFile.<init>(TwoBitSequenceFile.java:151)
  at htsjdk.samtools.reference.TwoBitSequenceFile.<init>(TwoBitSequenceFile.java:146)
  at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:132)
  at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:95)
  ...
lindenb commented 5 years ago

@yfarjoun @lbergelson thank you for having a look at this. I'm still thinking of really integrating this in htsjdk: Reading '.2bit' files remains slow compared to plain fasta files :

e.g: I tested reading 1E6 random sequences using 3 referencesequencefile (using the same random seed).


t=23731 secondes class=class htsjdk.samtools.reference.TwoBitSequenceFile
t=253126 secondes class=class htsjdk.samtools.reference.BlockCompressedIndexedFastaSequenceFile
t=5705 secondes class=class htsjdk.samtools.reference.IndexedFastaSequenceFile 

I'm don't know well the APIs for decoding bits, may be my implementation could be improved.

Furthermore '2bit' are not standard in the HTS workflows, and it doesn't fit well with classes.

But they can be (slowly) accessed over http. Nevertheless, I will try to fix the current errors.