Closed yaseminbridges closed 4 months ago
@julesjacobsen can you double-check that what I have done here makes sense? This is what I ended up doing for the AI-MARRVEL benchmarks as there were missing outputs it had different total counts and the comparisons of the ranks were a bit messed up
Currently, we iterate over the PhEval TSV processed output --> find the corresponding phenopacket in the test data directory --> benchmark this way.
This change will instead, iterate over the phenopackets in the test data directory --> find the corresponding PhEval TSV output (if none found this is handled) --> benchmark this way
This will allow us to account for missing results, i.e., if a tool fails on certain results no output is written then this is accounted in the new way of benchmarking