Closed skoren closed 4 years ago
Apologies for the late response and the sparse documentation - I'll try to fix that in the coming month.
We do use BAM files to store the raw alignments, but the files are too unwieldy for downstream processing so we subset the information in the BAM to a tabular format that only includes alignment coordinates, scores and some other information (the '.alignment.parquet' file). By removing the excess information we can reduce a ~500GB BAM to a 12GB table. If you want to re-generate the BAM file for the sample datasets we shared you can use the associated snakemake pipeline.
In general the individual pore-C tools try to consume/produce tidy tabular data where possible. Some summary metadata is stored in intake compatible yaml files, but all of the most important data are in tabular form and stored on disk using the parquet format (reasons below). The most important one is the '*.alignment.parquet' table which has the following schema:
field | example value | datatype | description |
---|---|---|---|
read_idx | 0 | int64 | unique integer id for the read |
align_idx | 0 | int64 | unique integer id for the alignment |
mapping_type | primary | categorical | primary, secondary, supplemental, unmapped |
chrom | chr6 | categorical | chromosome name |
start | 1599634 | int64 | genomic alignment start |
end | 1600965 | int64 | genomic alignment end |
strand | False | bool | alignment strand |
read_name | 0000d441-7dd8-4161-ab64-710547ce4d0a | string | read name |
read_length | 6830 | int64 | length in bases of read |
read_start | 1403 | int64 | start of alignment on read |
read_end | 2796 | int64 | end of alignment on read |
mapping_quality | 124 | uint8 | MQ score from aligner |
score | 639 | int64 | alignment score from AS tag in bam |
pass_filter | True | bool | True if alignment passes pore-C filters |
reason | pass | categorical | pass/unmapped/singleton/low_mq/overlap_on_read/not_on_shortest_path |
fragment_id | 4885554 | int64 | unique integer id of the restriction fragment assigned to this alignment |
contained_fragments | 0 | int64 | number of restriction fragments completely contained within this alignment |
fragment_start | 1599550 | int64 | genomic start site of the assigned restriction fragment |
fragment_end | 1600611 | int64 | genome end site of the assigned restriction fragment |
perc_of_alignment | 73.40345764160156 | float32 | proportion of alignment that overlaps the assigned restriction fragment |
perc_of_fragment | 92.08293914794922 | float32 | proportion of the assigned restriction fragment that overlaps the alignment |
How you read it depends on what your preferred language is but in python this is how I would load the table using pandas + pyarrow:
import pandas as pd
df = pd.read_parquet('prefix.alignment.parquet', engine="pyarrow")
There are many reasons for choosing parquet as the storage format for the tabular data:
Thanks for the reply, I was able to get the data out via the command line tools from apache and could figure out most of the columns from the schema. I think it would be nice to have a quick-start guide pointing to how you'd get info from the files just from command line not programmatically. The parquet documentation for the command line tools seems very sparse (at least on the main page). If it's useful for others, what I ended up doing was:
mvn clean package -Plocal
java -jar parquet-tools/target/parquet-tools-1.11.0.jar
command to get information from the files, some examples:
java -jar parquet-tools/target/parquet-tools-1.11.0.jar schema file.parquet
java -jar parquet-tools/target/parquet-tools-1.11.0.jar head -n 1 file.parquet
java -jar parquet-tools/target/parquet-tools-1.11.0.jar cat file.parquet
I don't see any documentation of the pore-c format used or examples to view the files. Is there a reason to use a new format rather than something commonly used like sam/bam?