jkbonfield / io_lib

Staden Package "io_lib" (sometimes referred to as libstaden-read by distributions). This contains code for reading and writing a variety of Bioinformatics / DNA Sequence formats.
Other
36 stars 15 forks source link

fread() on reference file #34

Closed esamorodnitsky closed 4 years ago

esamorodnitsky commented 4 years ago

Hi, can you help me? I have no idea what I am doing wrong! I am trying to run the following command:

scramble in.bam -O CRAM -r our_labs_hg19.fasta > out.cram

and get this error, below. This reference file does exist, so I can't imagine why it's giving me this. fread() on reference file: No such file or directory

esamorodnitsky commented 4 years ago

Wait, never mind, I figured out the issue

esamorodnitsky commented 4 years ago

Apparently, I had accidentally erased our lab's genome file. So, I will get it restored. Sorry about this.

jkbonfield commented 4 years ago

No problem and thanks for responding and closing the issue promptly.

Note it's possible to embed the genome reference with the cram file, although it's rarely used, with "scramble -e". This does bloat the file a bit, but it's not significant if the data is deep enough.

esamorodnitsky commented 4 years ago

Actually, I wanted to ask you, what does it mean to "embed" the reference genome exactly? What is the difference between -e and -r?

jkbonfield commented 4 years ago

The -e option puts a copy of the reference sequence used per slice. So if a slice covers position chr10:100000-101000 then it'll have 1Kb of reference in it.

Obviously this doesn't work on non-chr-pos sorted data, and it becomes inefficient on highly fragmented assemblies as it only permits one embedded reference per slice. (This is an obvious weakness of the CRAM specification.)