ucscGenomeBrowser / kent

UCSC Genome Browser source tree. Stable branch: "beta".
http://genome.ucsc.edu/
Other
219 stars 89 forks source link

Question about twoBit.c / java #22

Closed lindenb closed 5 years ago

lindenb commented 5 years ago

Hi the UCSC team,

I'm currently writing a PR for the "Java API for high-throughput sequencing data (HTS) formats". htsjdk project .

The goal of my PR https://github.com/samtools/htsjdk/pull/1417 is to write a java code handling the '.2bit' format. My java code largely inspired by your C code twoBit.c

1) are you ok with including my code in the htsjdk project ? should I add any specific license (currently MIT) or any author in my code ?

2) a technical question: I need to build a SequenceDictionary where the order of the contigs must be the same than in the input fasta. When faToTwoBit builds a '.2bit' file, is the order of the sequences in the original fasta file always the same than in the '.2bit' file (at this position, when reading : https://github.com/ucscGenomeBrowser/kent/blob/master/src/lib/twoBit.c#L658 ) or is there any re-ordering by a hash-table ?

Thank you,

Pierre

NullModel commented 5 years ago

Good Morning Pierre:

Your proposed license is fine. Yes, you can use your code elsewhere. Yes, it appears that the order you put fasta into the 2bit is the order you will get out if you simply read it all:

faCount sequences.fa faToTwoBit sequences.fa test.2bit twoBitToFa test.2bit stdout | faCount stdin

Produces the same faCount output

Please be aware of the byte swapping issue. The kent C code will write out files in the native byte order of the machine it is running on. There is a tag in the file to indicate the byte order so that the reader can adjust to any file encountered regardless of where it was produced. If you are also creating a writer function, it should tag the file appropriately.

lindenb commented 5 years ago

many thanks for your answer