nanoporetech / Pore-C-Snakemake

Other
33 stars 14 forks source link

mapq filter #12

Open biofilos opened 3 years ago

biofilos commented 3 years ago

Hello. (edited for clarity) I was wondering if it is possible to specify the mapq threshold for read filtering in pore-c-snakemake. Now, I have to calculate all those stats (contacts per GB, cis/trans, etc) after the hic file is generated via juicer (where I specify the mapq filter), but for benchmarking and optimization purposes, it would be very useful if the stats reported by pore-c-snakemake reflect those filters. Thank you

eharr commented 3 years ago

Hi @biofilos,

I'm sorry for the delayed response to this, I had to make some changes to the way the contacts are stored to enable this. If you re-run your samples through the latest version of the pipeline there's a table in parquet format with a .contacts.parquet extension. Each row in the table is a contact and the map quality for each "side" of the contact is stored. You can analyse this table directly in R/pandas/julia to see how many cis/trans contacts you get You can also add --min-mapping-qual to the pore_c contacts export command to apply this filter at the time of contact map creation.

Eoghan

biofilos commented 3 years ago

Thanks, I am running the new pipeline now. I was a bit confused about the output. I can't see a .contacts.parquet file, but a .contacts.parquet directory inside merged_contacts. I am new to parquet, so I was wondering if I could just read that directory. I tried with pandas.read_parquet, but got a "NotImplementedError: Encoding 8`. I got the same error if I tried to read other parquet files from the pore-c pipeline. How do you guys read parquet files from Python? I got fastparquet, pyarrow, and python-snappy. Do you have any other recommendation?

eharr commented 3 years ago

Yes you should be able to read that directory and others using pandas + pyarrow. I haven't seen that error before which makes me suspect that it might be something to do with the versions. If you install pore-c tools from bioconda it should pull the correct versions of the libraries - you should then be able to use df = pd.read_parquet("sample.contacts.parquet", engine="pyarrow").

There's also a utility in pore_c tools to convert the parquet files to csv: pore_c utils parquet_to_csv <path_to_parquet> <path_to_csv>