parquet format - Githubissues

skoren commented 4 years ago

I don't see any documentation of the pore-c format used or examples to view the files. Is there a reason to use a new format rather than something commonly used like sam/bam?

eharr commented 4 years ago

Apologies for the late response and the sparse documentation - I'll try to fix that in the coming month.

We do use BAM files to store the raw alignments, but the files are too unwieldy for downstream processing so we subset the information in the BAM to a tabular format that only includes alignment coordinates, scores and some other information (the '.alignment.parquet' file). By removing the excess information we can reduce a ~500GB BAM to a 12GB table. If you want to re-generate the BAM file for the sample datasets we shared you can use the associated snakemake pipeline.

In general the individual pore-C tools try to consume/produce tidy tabular data where possible. Some summary metadata is stored in intake compatible yaml files, but all of the most important data are in tabular form and stored on disk using the parquet format (reasons below). The most important one is the '*.alignment.parquet' table which has the following schema:

field	example value	datatype	description
read_idx	0	int64	unique integer id for the read
align_idx	0	int64	unique integer id for the alignment
mapping_type	primary	categorical	primary, secondary, supplemental, unmapped
chrom	chr6	categorical	chromosome name
start	1599634	int64	genomic alignment start
end	1600965	int64	genomic alignment end
strand	False	bool	alignment strand
read_name	0000d441-7dd8-4161-ab64-710547ce4d0a	string	read name
read_length	6830	int64	length in bases of read
read_start	1403	int64	start of alignment on read
read_end	2796	int64	end of alignment on read
mapping_quality	124	uint8	MQ score from aligner
score	639	int64	alignment score from AS tag in bam
pass_filter	True	bool	True if alignment passes pore-C filters
reason	pass	categorical	pass/unmapped/singleton/low_mq/overlap_on_read/not_on_shortest_path
fragment_id	4885554	int64	unique integer id of the restriction fragment assigned to this alignment
contained_fragments	0	int64	number of restriction fragments completely contained within this alignment
fragment_start	1599550	int64	genomic start site of the assigned restriction fragment
fragment_end	1600611	int64	genome end site of the assigned restriction fragment
perc_of_alignment	73.40345764160156	float32	proportion of alignment that overlaps the assigned restriction fragment
perc_of_fragment	92.08293914794922	float32	proportion of the assigned restriction fragment that overlaps the alignment

How you read it depends on what your preferred language is but in python this is how I would load the table using pandas + pyarrow:

import pandas as pd

df  = pd.read_parquet('prefix.alignment.parquet', engine="pyarrow")

There are many reasons for choosing parquet as the storage format for the tabular data:

good language support (python, R, c, C++ and more) via the apache arrow project
self-describing (poor excuse for lack of documentation)
columnar storage, so only read what you need
built-in compression
row-wise partitioning of the data (similar to tabix, but doesn't require coordinate sorting)
integrated with the dask project which allows multi-thread/process/node execution in python

skoren commented 4 years ago

Thanks for the reply, I was able to get the data out via the command line tools from apache and could figure out most of the columns from the schema. I think it would be nice to have a quick-start guide pointing to how you'd get info from the files just from command line not programmatically. The parquet documentation for the command line tools seems very sparse (at least on the main page). If it's useful for others, what I ended up doing was:

Download from https://github.com/apache/parquet-mr/archive/apache-parquet-1.11.0.tar.gz
Install using mvn clean package -Plocal

Use the java -jar parquet-tools/target/parquet-tools-1.11.0.jar command to get information from the files, some examples:

java -jar parquet-tools/target/parquet-tools-1.11.0.jar schema file.parquet
java -jar parquet-tools/target/parquet-tools-1.11.0.jar head -n 1 file.parquet
java -jar parquet-tools/target/parquet-tools-1.11.0.jar cat file.parquet

nanoporetech / pore-c

parquet format #33