Why does the bedgraph output have 5 columns?

nanoporetech / modkit

A bioinformatics tool for working with modified bases

https://nanoporetech.com/

Other

139 stars 7 forks source link

Why does the bedgraph output have 5 columns? #158

Open IanCodes opened 6 months ago

IanCodes commented 6 months ago

Hello,

I am new to modkit, so I may be asking silly questions. The bedgraph generated using modkit pileup --bedgraph generates bedgraphs with five columns. The format description at UCSC describes four columns.

My understanding from another post is that currently column 4 is fraction modified, and column five is score. Is that correct?

The result is that the bedgraph files cannot be converted to bigWig using UCSC bedGraphToBigWig as there are too many columns. Would it make sense to only use columns 1, 2, 3, and 5?

ArtRand commented 6 months ago

Hello @IanCodes,

I would use columns 1, 2, 3, and 4. Described below:

column	name	description	type
1	chrom	name of reference sequence from BAM header	str
2	start position	0-based start position	int
3	end position	0-based exclusive end position	int
4	fraction modified	N_mod / N_{valid_cov}	float
5	N_{valid_cov}	See definitions above.	int

I added the N_valid_cov column so that you could filter out rows with low (or high) valid coverage, for example you could have 1 read that's modified and that will appear as 100% (which is true) but you may not want that single site in your analysis or visualization. In the viewers I use most often you can have additional columns for in bedGraph.

To make the data work with bedGraphToBigWig use the following (for example):

awk -v min_cov=${min_cov} '$5>min_cov{print $1, $2, $3, $4}' ${modkit_bedgraph_file}

Hope this helps, happy to answer any additional questions.

IanCodes commented 6 months ago

@ArtRand thank you for your helpful answer! I usually use UCSC that has the four column specification for generating bigWig. I did see a useful thread about directly viewing bedmethyl in IGV (https://github.com/nanoporetech/modkit/issues/43), so maybe that is the best route?

The --bedgraph option outputs a file for + and - strands for each modification, is it possible (or even make sense) to have a single combined strand bedgraph file per modifcation?

Thank you.

ArtRand commented 6 months ago

Hello @IanCodes,

You can certainly combine them. The bedGraph format doesn't have a nice way to represent different modifications, however, so you might want to combine them separately.

cat h*.bedgraph | sort -k 1,2n |cut -f 1-4 > h_strand_combined.bedgraph

Ought to do it. The cut is optional to remove the valid coverage column, you could also filter instead.

lkwhite commented 3 months ago

Thanks for all the work developing this. Would be great if this description of the column outputs were added to the documentation for bedgraph pileup here.

ArtRand commented 3 months ago

Hello @lkwhite,

To be honest, I'm planning on deprecating the --bedgraph flag since you can generate the bedGraph from the bedMethyl, but having an un-documented output format is poor form regardless.