pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
774 stars 274 forks source link

Fix error text when array tag typecode is invalid #1235

Open marcus1487 opened 11 months ago

marcus1487 commented 11 months ago

In current master when, this line is hit it raises an error reporting: AttributeError: module 'array' has no attribute 'typecode'. This is due to an apparent bug in the formatting of the intended error text. In order to report the valid error text this should be resolved. I believe the intent here is to report the typecode ordinal value (with the character provided earlier), but the intent may have been to report the value.typecode in this error text. I can revert that in this PR if that is indeed the intent.

bioinformed commented 11 months ago

Thanks! I think your fix is the appropriate one.

marcus1487 commented 11 months ago

Digging a bit deeper I think this is obfuscating a deeper bug in this function. It seems that when the valuetype is passed for an array type (B) here, that the typecode variable is now overloaded to indicate the type of the tag (int, string, array) as well as the type of the underlying array in this block. I think a second variable should be introduced for the array type so that the typecode can be reserved for the tag type and not the "sub-type" of the array. I can open a new PR for this or add this fix in here if you prefer.

marcus1487 commented 11 months ago

I am relatively certain that this is a bug now. It looks like the overloading of the typecode variable is intentional from the definition of DATATYPE2FORMAT so it seems that the intended behavior for this check is to ensure either the valuetype was not set (typecode == 0) or that it was set to array type (typecode == ord("B")). An error should be raised in the case that a value of type array.array was passed with a valuetype other than B. Then the array type should be determined. Happy to add this onto this PR to keep things simple.

jmarshall commented 11 months ago

I have been looking into this in the context of #1233. The correct fix to the error message is to change that to value.typecode and I have further improvements to the error message also drafted.

The function is unnecessarily incomprehensible because “typecode” is used to mean two different things: the Python array type code and the SAM/BAM aux type code. So the variables representing the SAM thing need to be renamed to auxtype or so.

Your last comment has been misled by this: B here means unsigned char (i.e. C in SAM-speak), not B-aux-array. So restricting the types of Python array passed in would not be correct.

marcus1487 commented 11 months ago

Your last comment has been misled by this: B here means unsigned char (i.e. C in SAM-speak), not B-aux-array. So restricting the types of Python array passed in would not be correct.

I think we are saying the same thing here (made confusing by the overloaded meaning put on the typecode variable). I was saying that since the typecode is overloaded when it enters this block it holds the value b'B' representing the aux-array type of the tag. Then after the check the meaning of the value is updated to the type of the aux array, but when the b'B' value is passed in this update to the variable does not happen.

In any case if you are working on this method and it will be fixed in the next release I will leave it to you. If you'd like for me to have a go at it I'd be more than happy to do so.

I guess google has not scraped issue #1233 as I did not get this hit when starting to look into this issue.