hoffmangroup / acidbio

14 stars 1 forks source link

Test data FASTA coordinates inconsistent with BED coordinates #8

Closed mdshw5 closed 1 year ago

mdshw5 commented 2 years ago

First off - thanks for publishing this work and making the test suite available. I think this type of project is really healthy for the bioinformatics community. I'm trying to understand the relationship between your test "toy" FASTA coordinates, and the requested coordinates in the example BED files. Based on the FASTA index it appears that the chromosome fragments are 255000 characters long, and so the FASTA sequence coordinates would be chr19:1-255001 and chr20:1-255001:

https://github.com/hoffmangroup/acidbio/blob/a7ddfd8ab13c40412278ef9a0f202a472c85a9b8/bed/data/toy.fa.fai#L1-L2

However when I take a look at the "good" BED03 files I see that there are many regions located outside these coordinates. For instance:

https://github.com/hoffmangroup/acidbio/blob/a7ddfd8ab13c40412278ef9a0f202a472c85a9b8/bed/BED03/good/other-comment_start.bed#L76-L78

In the above example the regions start within the valid FASTA sequence coordinates, but quickly exceed the end of the FASTA record. I've checked this against the Zenodo archive and it seems like the manuscript was generated using the same data, so I'm at a loss to explain how this should work. Maybe I'm missing something?

mdshw5 commented 2 years ago

I'm also curious how the known and unknown scaffolds are supposed to be part of the "good" (expected pass) test scenario, since these FASTA records are missing from the toy example:

https://github.com/hoffmangroup/acidbio/blob/a7ddfd8ab13c40412278ef9a0f202a472c85a9b8/bed/BED03/good/01-known_scaffolds.bed#L1-L3

https://github.com/hoffmangroup/acidbio/blob/a7ddfd8ab13c40412278ef9a0f202a472c85a9b8/bed/BED03/good/01-unknown_scaffolds.bed#L1-L3

niujeffrey commented 2 years ago

Thanks for your comments Matt.

Regarding the toy.fa coordinates, there indeed are regions in the BED files that are outside of the toy.fa sequence. This unfortunately slipped by because it seems to not affect the performance of any tools that we tested that use the toy.fa file. I extended the chr19 and chr20 sequences in toy.fa and it gives the same results.

However, for the scaffolds, when I added the scaffold sequences to toy.fa, it caused an error in a tool because not all sequences in the FASTA were present in the BED file. I think this is an instance of the limitation of this testing framework. It is very difficult to encapsulate all possible uses of these tools in a single test file.

Instead of using toy.fa when testing your own tools, it may be more useful to either use hg38.fa or another FASTA file that best matches the intended use.

I will update toy.fa to be lengthened soon.

mdshw5 commented 2 years ago

Thanks for the response. Indeed it's going to be tough to determine how tools should validate BED/FASTA pairings, but it is somewhat concerning that these tools did not raise an exception when requesting coordinates outside of the toy.fa.