refresh-bio / agc

Assembled Genomes Compressor
MIT License
152 stars 13 forks source link

Python library: Using GetCtgSeq() returns sequences with numerous leading repetitions of "\x00" #15

Closed patrick-koenig closed 5 months ago

patrick-koenig commented 5 months ago

Hello,

many thanks for this very useful tool and the Python bindings!

While integrating an AGC archive into a web service, I discovered the following behaviour of the GetCtgSeq() method:

seq = agc_arch.GetCtgSeq('Morex', 'chr1H', 400000000, 400000099)
print(seq)

Output:

'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00GAACAGATCCACACCGCAAACGGAGCAGGTATGAGGATTCAGCGTGTTGGTCATTCAGTTATAATAACCCCTCATCAAAAATTCCATCTTAGAAATATTC'

As one can see the sequence itself is prefixed with numerous repetitions of "\x00".

One can easily remove them by using seq.lstrip('\x00') but it would be better if this is not necessary with a future release.

Thanks

sebastiandeorowicz commented 5 months ago

Is it possible to provide me the dataset for which this happens? We plan to release AGC v.3.1 today or just after the weekend. If possible we would fix this bug.

patrick-koenig commented 5 months ago

@sebastiandeorowicz Yes, with pleasure! Do you need the final archive file or the FASTA files? Just as a note: The FASTA files look absolutely normal.

sebastiandeorowicz commented 5 months ago

Fasta and command line you used for compression would be great. It would allow me to check if the problem is only related to library or to compression. If FASTAs are huge, let's start from the archive. Maybe it will be sufficient.

patrick-koenig commented 5 months ago

The used FASTA files can be downloaded here: https://galaxy-web.ipk-gatersleben.de/libraries/folders/F9ce617df390851fc/page/1

Command line was those from the examples in the README. Sorry, I did not record the command. But it was nothing special. More or less default parameters for agc create.

The problem appears only when using the Python bindings. If I use command line executable, everything is fine with the sequence extraction. I guess the bug is located in the source code for the Python binding.

sebastiandeorowicz commented 5 months ago

Probably you are right. I'm downloading and let you know after fixing the bug. I hope we will be able to include the fix in v.3.1.

sebastiandeorowicz commented 5 months ago

One more thing. Which sample is the reference?

patrick-koenig commented 5 months ago

The reference was the genotype "Morex" with this filename: Morex_v3_pseudomolecules_and_unplaced_contigs_ENA.fasta.gz

patrick-koenig commented 5 months ago

What I forgot to mention: The preceding repititions of \x00 are independet from the parameters of the GetCtgSeq() method. It doesn't matter if I set start to the first nucleotide (position = 0) or to some position near the end of the contig. It also doesn't matter which genotype or contig I choose.

OS is Rocky Linux 9 with Python 3.11 via miniconda. To make the bindings work with Python 3.11 I switched pybind11 to version 2.11.1. But the \x00 issue also appeared with Python 3.10 and pybind11-2.8.1 as it is in your repository here.

sebastiandeorowicz commented 5 months ago

You were right. The bug is related to the connection between the C++ and Python libraries. We will fix it in the forthcoming release.

sebastiandeorowicz commented 5 months ago

AGC 3.1 with the fix is released.