nanoporetech / pore-c

Pore-C support
Mozilla Public License 2.0
52 stars 5 forks source link

parquet format #33

Closed skoren closed 4 years ago

skoren commented 4 years ago

I don't see any documentation of the pore-c format used or examples to view the files. Is there a reason to use a new format rather than something commonly used like sam/bam?

eharr commented 4 years ago

Apologies for the late response and the sparse documentation - I'll try to fix that in the coming month.

We do use BAM files to store the raw alignments, but the files are too unwieldy for downstream processing so we subset the information in the BAM to a tabular format that only includes alignment coordinates, scores and some other information (the '.alignment.parquet' file). By removing the excess information we can reduce a ~500GB BAM to a 12GB table. If you want to re-generate the BAM file for the sample datasets we shared you can use the associated snakemake pipeline.

In general the individual pore-C tools try to consume/produce tidy tabular data where possible. Some summary metadata is stored in intake compatible yaml files, but all of the most important data are in tabular form and stored on disk using the parquet format (reasons below). The most important one is the '*.alignment.parquet' table which has the following schema:

field example value datatype description
read_idx 0 int64 unique integer id for the read
align_idx 0 int64 unique integer id for the alignment
mapping_type primary categorical primary, secondary, supplemental, unmapped
chrom chr6 categorical chromosome name
start 1599634 int64 genomic alignment start
end 1600965 int64 genomic alignment end
strand False bool alignment strand
read_name 0000d441-7dd8-4161-ab64-710547ce4d0a string read name
read_length 6830 int64 length in bases of read
read_start 1403 int64 start of alignment on read
read_end 2796 int64 end of alignment on read
mapping_quality 124 uint8 MQ score from aligner
score 639 int64 alignment score from AS tag in bam
pass_filter True bool True if alignment passes pore-C filters
reason pass categorical pass/unmapped/singleton/low_mq/overlap_on_read/not_on_shortest_path
fragment_id 4885554 int64 unique integer id of the restriction fragment assigned to this alignment
contained_fragments 0 int64 number of restriction fragments completely contained within this alignment
fragment_start 1599550 int64 genomic start site of the assigned restriction fragment
fragment_end 1600611 int64 genome end site of the assigned restriction fragment
perc_of_alignment 73.40345764160156 float32 proportion of alignment that overlaps the assigned restriction fragment
perc_of_fragment 92.08293914794922 float32 proportion of the assigned restriction fragment that overlaps the alignment

How you read it depends on what your preferred language is but in python this is how I would load the table using pandas + pyarrow:

import pandas as pd

df  = pd.read_parquet('prefix.alignment.parquet', engine="pyarrow")

There are many reasons for choosing parquet as the storage format for the tabular data:

skoren commented 4 years ago

Thanks for the reply, I was able to get the data out via the command line tools from apache and could figure out most of the columns from the schema. I think it would be nice to have a quick-start guide pointing to how you'd get info from the files just from command line not programmatically. The parquet documentation for the command line tools seems very sparse (at least on the main page). If it's useful for others, what I ended up doing was: