Reference sample selection could be made more robust

jmmitc06 commented 1 year ago

As discussed the other day, the selection of the reference sample before MassGrid construction could be made more reliable. Currently, as explained to me by Minghao, is that the reference sample is selected at random. This should be fine in the majority of cases but when a completely different sample is present in the experiment (such as including the DDA sample by mistake) and it is selected as the reference, all other samples will fail to align. I have not tested this but I could imagine this will occur if there is an experiment containing a failed injection and that sample is selected as the reference.

Not sure what the alternative should be but one idea is selecting the sample with the TIC closest to the median of all TICs? or if we have too many failed alignment messages we select another at random.

shuzhao-li commented 1 year ago

The selection is in ext_Experiment.get_reference_sample_id, either by user specification, or using the sample of most number_anchor_mz_pairs, limited to first 100 samples to search. This assumes the sample of most good m/z values has a good coverage of features.

jmmitc06 commented 1 year ago

Thanks, after looking at the implementation and with your description, that makes sense.

However, if we are wanting to limit the search to the first 100 samples, I do not see that logic in the get_reference_sample_id implementation. From my reading of the function, it will search all samples in sample_registry which seems better than an arbitrary limit of 100 samples.

Once the above issue of limiting the search is clarified, I can close this issue.

shuzhao-li commented 1 year ago

Good catch! Docstring from earlier versions, fixed.

jmmitc06 commented 1 year ago

Closing issue since original concern was not legitimate (i.e., the reference sample is not selected at random) and the issue with the docstring has been resolved.

shuzhao-li-lab / asari

Reference sample selection could be made more robust #36