Closed hsosik closed 10 months ago
A reasonable starting place for criteria and tolerance:
if abs(ml_analyzed_hdr-ml_analyzed_adc)<0.05, ml_analyzed = ml_analyzed_hdr else ml_analyzed = ml_analyzed_adc end
Among 1074 bins in the SG2105 dataset, here are the 19 bins that have bad ml_analyzed_hdr according to the above criteria:
pid ml_analyzed_hdr ml_analyzed_adc
____________________________ _______________ _______________
{'D20210503T171928_IFCB115'} 0.88862 0.83318
{'D20210507T225916_IFCB102'} 32.456 3.3309
{'D20210511T220157_IFCB102'} 13.123 2.2706
{'D20210512T155857_IFCB102'} 0.024627 2.9072
{'D20210513T003630_IFCB102'} 0.024627 2.0426
{'D20210513T113350_IFCB102'} 0.024627 2.6736
{'D20210513T234948_IFCB102'} 6.5381 2.6978
{'D20210514T090858_IFCB102'} 0.024627 2.5023
{'D20210514T172134_IFCB102'} 2.7174 2.1922
{'D20210515T163434_IFCB102'} 10.81 3.0479
{'D20210515T232812_IFCB102'} 4.987 2.9775
{'D20210518T014048_IFCB102'} 10.809 2.6808
{'D20210518T081012_IFCB102'} 4.9725 2.6879
{'D20210518T205045_IFCB102'} 0.49873 2.6363
{'D20210519T152904_IFCB102'} 4.9872 2.1729
{'D20210519T161741_IFCB102'} 0.00049828 2.3269
{'D20210520T025630_IFCB102'} 32.687 2.7402
{'D20210520T135944_IFCB102'} 4.9873 2.8065
{'D20210520T182732_IFCB102'} 6.397 1.4393
@hsosik why not simply always use the value computed from the ADC data?
The hdr value is more accurate-when it's not wrong. It is better most of the time.
Reopening this to report on recent work and close out the changes
I don't see recent activity, so maybe this is not the correct place to post this--let me know if there's another more active issue. I see something unexpected in the results now in the IFCB dashboard database for at least one case in 2023.
For this bin: https://ifcb-data.whoi.edu/bin?bin=D20230727T025611_IFCB127 volume analyzed is now reported as: Volume Analyzed: 4.988 ml
My matlab result is as follows: _>> IFCB_volume_analyzed('https://ifcb-data.whoi.edu/mvco/D20230727T025611_IFCB127.hdr') ans = 2.8529_
It appears that the current python code must be using two bad lines at the end of the adc file that have inhibittime reported at 0, which is incorrect.
This difference in result from the matlab and python code should show up in more bins if we do a more systematic comparison between the two, which I think should be done to make sure there are not other inconsistencies.
Here is another example from the EXPORTS data set that is also not working in the python implemenation: https://ifcb-data.whoi.edu/bin?bin=D20210501T163341_IFCB125
IFCB dashboard shows: Volume Analyzed: 4.978 ml
Matlab result is: _>> IFCB_volume_analyzed('https://ifcb-data.whoi.edu/EXPORTS/D20210501T163341_IFCB125.hdr') ans = 1.748566335416668_
This is a case with many 0 values in the last two time columns in the adc file and it is supposed to be handled by a case in the code that uses only the non-zero time rows along with a mode value of the good inhibit times for the 0 rows: %second best estimate, last good row, plus mode as best guess for each bad row inhibittime(count) = adc.Var24(iii(end)) + (size(adc,1)-length(iii)) * modeinhibittime-inhibittime_offset;
Is the python code missing these cases or did the wrong code get implemented for the new updates to the dashboard database?
Your diagnosis is correct and adding the case brings the Python and MATLAB code into agreement for these bins. PR is #76
Some cases have obviously bad info in the hdr file for ml_analyzed estimates, but other cases are less obvious but still bad. A brute force approach is to compute ml_analyzed both ways, compare the results, and select the adc based value in cases where the difference is outside some tolerance. This presumes the adc value is more likely to be correct (which seems to be true from my inspection of results).