v11: Chunk Fasta files in parallel with components

graph-genome / component_segmentation

Read in ODGI Bin output and identify co-linear components

Apache License 2.0

3 stars 4 forks source link

v11: Chunk Fasta files in parallel with components #11

Closed josiahseaman closed 4 years ago

josiahseaman commented 4 years ago

Depends on: (https://github.com/vgteam/odgi/issues/88) component_segmentation read in single FASTA file. It chunks up FASTA in parallel with component dividers and assigns corresponding names for (https://github.com/graph-genome/Schematize/issues/17)

josiahseaman commented 4 years ago

Start with FASTA mockup file.
Get sequence out of the bin file, from fresh odgi

partitions, bin2file_mapping = schematic.split(args.cells_per_file) is a good place to start.

Final output: FASTA files, correlate to chunk boundaries. One FASTA for every chunk.json
start and stop bin should be in the fasta header and filename is chunk05.fasta

josiahseaman commented 4 years ago

@mandosoft Thomas is working on this issue.

josiahseaman commented 4 years ago

Bin math: Bin 0 is a reserved meaning. Biologists count nucleotide index starting at 1, not 0. That means with bin width 1,000, position 1 -1,000 is in bin 1. Target position X is in Bin label ceil(X /1,000). Fasta position in file (0 indexed) for Bin Y = [(Y - 1) 1,000 : Y 1,000] non-inclusive of the last nucleotide (default behavior in Python and Javascript). Nucleotide Index Z for Pangenome is Fasta[Z-1].