phuse-org / BioCelerate

Search scripts intended to be used to query and collate information from SEND datasets which can then be utilized by the searcher for their own cross-study analysis.
MIT License
6 stars 1 forks source link

Handling uncertain matched data in special cases #16

Closed bolrDK closed 2 years ago

bolrDK commented 3 years ago

Most of the getXXXX function contains (among others) these two input parameters:

• inclUncertain This is only relevant when a function is called with one or more filtering criterions – e.g. a list of species to be used to filter the set of animals to be returned.
If inclUncertain=TRUE, uncertain rows (i.e. rows where the species cannot be confidently identified) are included in the result set with the explanation included in column UNCERTAIN_MSG. If an input set of e.g. animals are included in the function call, generated UNCERTAIN_MSG values are concatenated to eventual already existing UNCERTAIN_MSG values in the corresponding rows from the input data set. If inclUncertain=FALSE, only rows which can be confidently matches to the input filtering criterions are included – and no column UNCERTAIN_MSG is added. But in this case, I have realised, that I haven’t thought about the scenario where the input data set includes uncertain records – i.e. contains UNCERTAIN_MSG which may include some non-empty values. What is the correct way to handle these data:

  1. Include the uncertain rows from the input data in the filtering executed by the current function. And the set of these uncertain input rows which confidently matches the filter in the current function are included in the output set with their original UNCERTAIN_MSG kept
  2. Exclude the uncertain rows from the input data set before executing the filtering in the current function
  3. Do not execute the current function, and report an error telling that if input data contains uncertain data, the function must be executed with inclUncertain=TRUE.

• noFilterReportUncertain This is only relevant when a function is called without any filtering criterions – i.e. when the purpose is to add values for a given parameter/column – e.g. study start date or sex of animal. If noFilterReportUncertain=TRUE, a column NOT_VALID_MSG is added to the output data set, and if the actual parameter/column value cannot confidently be identified, the reason is reported in NOT_VALID_MSG. And if the input data set (if such has been given in the function call) include a NOT_VALID_MSG, generated NOT_VALID_MSG values are concatenated to eventual already existing NOT_VALID_MSG values in the corresponding rows in the input data set. If noFilterReportUncertain=FALSE, no reason is reported for rows where the actual parameter/column value cannot confidently be identified and no column NOT_VALID_MSG is added. But again – equal to the case described for handling of uncertainties - I have realised, that I haven’t thought about the scenario where the input data set includes the NOT_VALID_MSG –what is the correct way to handle these data:

  1. Keep the NOT_VALID_MSG column with the existing values from the input dataset in the output data set
  2. Do not include the NOT_VALID_MSG column from the input dataset in the output data set
  3. Do not execute the current function, and report an error telling that if input data contains the NOT_VALID:MSG, the function must be executed with noFilterReportUncertain =TRUE Also here, I’m in doubt of the right solution for this – so please let me here your opinion also for this.

A solution is needed for these two open questions....

bolrDK commented 3 years ago

Answer from Kevin: My thought is that option 2 would be best in both cases, i.e. exclude the uncertain rows from the input data. I think that in practice, it would not be good to mix “certain” queries from some getXXXX functions with “uncertain” queries from other getXXXX functions, i.e. ideally inclUncertain and noFilterReportUncertain would be set to the same value for all getXXXX functions that are strung together for an analysis. Option 2 would practically ensure that this is the case as the user would need to set these parameters to TRUE in all getXXXX functions, otherwise the if any are set to FALSE, the output would be as if all were set to FALSE.

Answer from Bill: I think option 1 would be the most complete solution; however, I can imagine this is more complicated to program and test. I agree that this extra effort wouldn’t be necessary because a series of confident searches can be followed by a series of “uncertain” queries. After reviewing the results, a new query can be constructed that includes the desired records from the uncertain results in a new “confident-only” query.

I’m worried that option 2 could lead programmers to wonder why their uncertain results vanished. If we added a new function to strip uncertain results, we could effectively give programmers option 2 but without the mysterious disappearance of results.

bolrDK commented 3 years ago

It has been decided to go for option 1 to ensure no data are excluded unexpected. All the get..... functions must be updated to handled this behaviour correctly.