joefutrelle / pyifcb

IFCB data system, generation 2
MIT License
7 stars 6 forks source link

ml_analyzed should be computed from hdr and adc info to select the best estimate #70

Closed hsosik closed 10 months ago

hsosik commented 2 years ago

Some cases have obviously bad info in the hdr file for ml_analyzed estimates, but other cases are less obvious but still bad. A brute force approach is to compute ml_analyzed both ways, compare the results, and select the adc based value in cases where the difference is outside some tolerance. This presumes the adc value is more likely to be correct (which seems to be true from my inspection of results).

hsosik commented 2 years ago

A reasonable starting place for criteria and tolerance:

if abs(ml_analyzed_hdr-ml_analyzed_adc)<0.05, ml_analyzed = ml_analyzed_hdr else ml_analyzed = ml_analyzed_adc end

hsosik commented 2 years ago

Among 1074 bins in the SG2105 dataset, here are the 19 bins that have bad ml_analyzed_hdr according to the above criteria:

                pid                 ml_analyzed_hdr    ml_analyzed_adc
____________________________    _______________    _______________

{'D20210503T171928_IFCB115'}         0.88862           0.83318    
{'D20210507T225916_IFCB102'}          32.456            3.3309    
{'D20210511T220157_IFCB102'}          13.123            2.2706    
{'D20210512T155857_IFCB102'}        0.024627            2.9072    
{'D20210513T003630_IFCB102'}        0.024627            2.0426    
{'D20210513T113350_IFCB102'}        0.024627            2.6736    
{'D20210513T234948_IFCB102'}          6.5381            2.6978    
{'D20210514T090858_IFCB102'}        0.024627            2.5023    
{'D20210514T172134_IFCB102'}          2.7174            2.1922    
{'D20210515T163434_IFCB102'}           10.81            3.0479    
{'D20210515T232812_IFCB102'}           4.987            2.9775    
{'D20210518T014048_IFCB102'}          10.809            2.6808    
{'D20210518T081012_IFCB102'}          4.9725            2.6879    
{'D20210518T205045_IFCB102'}         0.49873            2.6363    
{'D20210519T152904_IFCB102'}          4.9872            2.1729    
{'D20210519T161741_IFCB102'}      0.00049828            2.3269    
{'D20210520T025630_IFCB102'}          32.687            2.7402    
{'D20210520T135944_IFCB102'}          4.9873            2.8065    
{'D20210520T182732_IFCB102'}           6.397            1.4393   
joefutrelle commented 2 years ago

@hsosik why not simply always use the value computed from the ADC data?

hsosik commented 2 years ago

The hdr value is more accurate-when it's not wrong. It is better most of the time.

joefutrelle commented 1 year ago

Reopening this to report on recent work and close out the changes

hsosik commented 1 year ago

I don't see recent activity, so maybe this is not the correct place to post this--let me know if there's another more active issue. I see something unexpected in the results now in the IFCB dashboard database for at least one case in 2023.

For this bin: https://ifcb-data.whoi.edu/bin?bin=D20230727T025611_IFCB127 volume analyzed is now reported as: Volume Analyzed: 4.988 ml

My matlab result is as follows: _>> IFCB_volume_analyzed('https://ifcb-data.whoi.edu/mvco/D20230727T025611_IFCB127.hdr') ans = 2.8529_

It appears that the current python code must be using two bad lines at the end of the adc file that have inhibittime reported at 0, which is incorrect.

This difference in result from the matlab and python code should show up in more bins if we do a more systematic comparison between the two, which I think should be done to make sure there are not other inconsistencies.

hsosik commented 1 year ago

Here is another example from the EXPORTS data set that is also not working in the python implemenation: https://ifcb-data.whoi.edu/bin?bin=D20210501T163341_IFCB125

IFCB dashboard shows: Volume Analyzed: 4.978 ml

Matlab result is: _>> IFCB_volume_analyzed('https://ifcb-data.whoi.edu/EXPORTS/D20210501T163341_IFCB125.hdr') ans = 1.748566335416668_

This is a case with many 0 values in the last two time columns in the adc file and it is supposed to be handled by a case in the code that uses only the non-zero time rows along with a mode value of the good inhibit times for the 0 rows: %second best estimate, last good row, plus mode as best guess for each bad row inhibittime(count) = adc.Var24(iii(end)) + (size(adc,1)-length(iii)) * modeinhibittime-inhibittime_offset;

Is the python code missing these cases or did the wrong code get implemented for the new updates to the dashboard database?

joefutrelle commented 11 months ago

Your diagnosis is correct and adding the case brings the Python and MATLAB code into agreement for these bins. PR is #76