kapsakcj / nanoporeWorkflow

:dna: Shell scripts for working with bacterial isolate Nanopore sequence data on CDC servers
MIT License
9 stars 3 forks source link

remove git-lfs requirement & alter test data format #21

Open kapsakcj opened 4 years ago

kapsakcj commented 4 years ago

Currently, simply downloading the scripts with git clone results in errors due to git-lfs. This is frustrating, but I think it is important to keep test data in the repo somehow.

One possible workaround is split the 3 fast5 files (each are ~240-362MB in size) into individual fast5 files. Right now there are a total of 9576 reads within these 3 fast5s, and we may be able to split them into have 1 fast5 for every read. File sizes would be MUCH smaller and would likely avoid having to use git-lfs.

I think the github limit is 100MB/file. 100GB/repository.

SciComp has a module fast5/2.0.1 for running ont_fast5_api which should allow us to split the fast5s. https://github.com/nanoporetech/ont_fast5_api#multi_to_single_fast5

kapsakcj commented 4 years ago

It was pretty easy to split the batch fast5s into single-read fast5s

ml fast5/2.0.1
cd github/nanoporeWorkflow/t/data
multi_to_single_fast5 --input_path SalmonellaLitchfield.FAST5 --save_path single-read-fast5s/ -t 16

Haven't commited/pushed these files to GitHub yet, still in progress on M3

kapsakcj commented 3 years ago

This commit added the single-read fast5s. Still need to remove multi-read fast5s and overhaul TravisCI tests. https://github.com/kapsakcj/nanoporeWorkflow/commit/c7def8f5d3bbd2056ed67109eb225e72cb646e5a