marcpaga / basecalling_architectures

The Unlicense
10 stars 5 forks source link

Missing of basecall_comb.py and Questions about the overlap parameter of the input channel #7

Closed sparkcyf closed 8 months ago

sparkcyf commented 8 months ago

Hi @marcpaga , thanks for your great work! I'm current using your code to implement some basecaller with mix network architecture. But found two issue about that:

https://github.com/marcpaga/basecalling_architectures/blob/f8686069c3021304aac03bf99bdb07849911ab00/src/classes.py#L1105-L1117

https://github.com/marcpaga/basecalling_architectures/blob/f8686069c3021304aac03bf99bdb07849911ab00/src/classes.py#L1275-L1277

Thanks in adavnce!

marcpaga commented 8 months ago

Hi @sparkcyf,

Missing Script in Repository: I noticed that the GitHub repository seems to be missing the ./scripts/benchmark folder, which should contain the basecall_comb.py script. This script is essential for the basecalling process of the mix architecture basecaller network. Could you please check and possibly update the repository with this script?

Sorry about this, there was a script called basecall_grid_analysis.py that does that. I have renamed it to basecall_comb.py as it is indeed a more appropiate name given he naming scheme. I have also updated its CLI interface. Please let me know if you run into any problems with this script.

Inconsistency in Overlap Parameters: I've observed an inconsistency in the overlap parameters between the input and output. Specifically, in basecalling_architectures/src/classes.py, the BaseFast5Dataset class is initialized with a window size of 2000 and an overlap of 400 at the input channel. However, the default overlap for the output, corresponding to the 2000 chunk size (window size), is set to 200. This discrepancy seems to cause repeated bases in the resulting fastq file. After aligning the overlap size in the BaseBasecaller class to 400, as in BaseFast5Dataset, the issue was resolved. Does the difference overlap size between the input and output channel an expected situation?

This is indeed an inconsistency in default parameters, it doesn't make much sense to have different parameters for this. I have changed the default to 200. Thanks for spotting this.

sparkcyf commented 8 months ago

Thank you for clarfying this! I'll try to use the same overlap parameter at BaseFast5Dataset and BaseBasecaller for my further basecalling workflow.