iobio / gene.iobio

Gene.iobio vue
MIT License
55 stars 11 forks source link

The SAC genotype field is listed in the "format" vcf field, but it is not available for selection in the dialog #1029

Closed tonydisera closed 8 months ago

tonydisera commented 8 months ago

For the demo exome vcf dataset, there are 8 format fields listed in the header, but 6 actual genotype values that can be parsed.

Here is the header:

Screenshot 2023-10-22 at 6 59 53 PM

But not all of these fields in the header are actually included in the genotype. Here is the format column, showing 6 fields:

Screenshot 2023-10-22 at 6 57 48 PM

And in the Select Annotations dialog: there are 7 fields listed for selection. Notice that SAC is not included and 2 fields (PGT, PID) are not available in the format (and genotype) fields.

Screenshot 2023-10-22 at 7 00 17 PM

To fix, you should filter the vcf.infoFields.FORMAT to only include those format fields (from the header) that are available in the format column.

As for the SAC field not being included, you will need to troubleshoot the code in vcf.iobio.js that creates the vcf.infoFields.FORMAT. I suspect that an error is occurring when trying to parse out the vcf header FORMAT record for the SAC field. I noticed some suspicious code in _parseHeaderForInfoORFormat. The field infoOrFormat is never initialized, so what happens if matches on the regular expression doesn't work? I'd try printing some console messages when matches == null to see if SAC is not getting parsed correctly. If that is the problem, then something must be wrong with the regular expression.

tonydisera commented 8 months ago

This is also a bug that needs to be fixed before 4.9 can be released.

tonydisera commented 8 months ago

Please make sure you create a branch off of the latest 4.9 branch as much of the code has changed.

YangQi007 commented 8 months ago

@tonydisera I checked the regular expression, it turns out that there maybe dot character in Number. The SAC is this case. Now it is fixed.

Screenshot 2023-10-22 at 10 21 52 PM

As for the genotype values are not matched to the Format Header you mentioned. For some variants, including the case you mentioned, just 6 values are matched. But in other variants, there are 8 of them. For example:

Screenshot 2023-10-22 at 10 24 50 PM

Do you think it makes sense that we keep displaying all 'Format Header' in selectVariantAnnotationDialog, but display the 'None' Value if some format headers are not matched in the Genotype Column in some variants.

AlistairNWard commented 8 months ago

This makes sense to me. We can’t guarantee that the header is accurate. Someone could have just added the header from a different vcf file, so it might have nothing to do with the actual data. We can’t validate this though, so we just assume the header is valid and show all options. So long as we’re clear that data is missing for a specific variant, this is fine

tonydisera commented 8 months ago

If INFO or FORMAT header rec cannot be parsed, print message to console about field being bypassed. Fix bug that was returning previous record when match failed.