iRNA-COSI / APAeval

Community effort to evaluate computational methods for the detection and quantification of poly(A) sites and estimating their differential usage across RNA-seq samples
MIT License
13 stars 14 forks source link

Adapt specs to have poly(A) sites of 1nt length #159

Closed ninsch3000 closed 3 years ago

ninsch3000 commented 3 years ago

Conclusion

According to discussion below, we'd like to report PAS as single nucleotides, i.e. their start and end coordinates differing by 1. The execution workflow output/ summary workflow input specifications have to be adjusted accordingly, and workflows have to be adapted, if applicable.

Discussion

Revisit the discussion in https://github.com/iRNA-COSI/APAeval/pull/91 as not everyone had the same opinion. Should a poly(A) site's ChromStart and ChromEnd position in a .bed file be the same (as in the specification https://github.com/iRNA-COSI/APAeval/blob/main/summary_workflows/quantification/Q2_benchmark/Q2_benchmark_specification.md), or differ by 1? What is the convention here?

mrgazzara commented 3 years ago

So I did a bit of digging on other PAS databases that report a single nucleotide PAS (as opposed to clusters like the wonderful PolyASite ;)) including PolyA_DB and APASdb. They simply provide a single coordinate (not in BED format) that corresponds to the last position on the transcript before the polyA tail is added.

To report this in a way consistent with BED format I would feel like I would have to agree with @mzavolan 's comment last week that ChromStart and ChromEnd should differ by 1 nt since BED is 0-based start and 1-based end: that is the single nucleotide reported should be the last nucleotide of the pre-mRNA transcript just upstream of the cleavage and polyadenylation reaction.

mzavolan commented 3 years ago

@ninsch3000 : although we have PAS clusters we also have a cluster representative (most frequently used position), correct?

ninsch3000 commented 3 years ago

Yes, but the representative is indicated as part of the cluster ID (in bed column 4). We don't have a bed file for the representative sites explicitly. However, if a cluster consists of only one nucleotide, we use the convention @mrgazzara described (end = start +1). That's why I brought up the whole discussion again.

@mfansler is arguing though, that we're talking about the cleavage position, and cleavage occurs between two nucleotides, thus we should use bed convention for insertion rather than single nucleotide. I think this makes sense, however, I thought the convention (if it exists) is rather to report a single nucleotide.

mzavolan commented 3 years ago

We can write a short "adaptor" script to put the polyAsite in the necessary format. I am not sure about the convention though. How is it in other poly(A) site databases?

ninsch3000 commented 3 years ago

Of course we can, @mrgazzara already has similar scripts which he used for transforming the Aseq2 ground truth cluster files to single nt sites, if I'm not mistaken. We just need to agree on a convention so we know which files to adapt.

As @mrgazzara mentions above, he checked 2 DBs, which don't provide bed files. I just checked Animal ApaDB, and there they also don't provide bed files, just report clusters or a single coordinate (in .xls files...)

mzavolan commented 3 years ago

OK @ninsch3000 @mrgazzara so probably best is to report 3' nucleotides instead of cleavage sites and then they will be 1 nucleotide in length.