pinder-org / pinder

PINDER: The Protein INteraction Dataset and Evaluation Resource
https://pinder-org.github.io/pinder/
Apache License 2.0
94 stars 7 forks source link

Select new random index on dimer filter failure #14

Closed ntoxeg closed 2 months ago

ntoxeg commented 2 months ago

This fixes a problem I’ve had with loading some data, in particular with the following setup:

base_filters = [
    filters.FilterByMissingHolo(),
    filters.FilterMetadataFields(pinder_af2=("is not", True)),
]
sub_filters = [
    filters.FilterSubRmsds(rmsd_cutoff=7.5),
]

loader = PinderLoader(
    split="train",
    monomer_priority="holo",
    base_filters=base_filters,
    sub_filters=sub_filters,
)

The problematic item is specifically at the index 7739:

item = loader[7739]

What happens is that I get the same error message repeated 10 times:

2024-09-13 17:20:41,550 | pinder.core.loader.loader:426 | ERROR : Failed to apply sub_filter=FilterSubRmsds on 4ag4__A1_Q08345--4ag4__C1_UNDEFINED: Tried fallback, but number of CA atoms does not match
2024-09-13 17:20:41,550 | pinder.core.loader.loader:426 | ERROR : Failed to apply sub_filter=FilterSubRmsds on 4ag4__A1_Q08345--4ag4__C1_UNDEFINED: Tried fallback, but number of CA atoms does not match
2024-09-13 17:20:41,550 | pinder.core.loader.loader:426 | ERROR : Failed to apply sub_filter=FilterSubRmsds on 4ag4__A1_Q08345--4ag4__C1_UNDEFINED: Tried fallback, but number of CA atoms does not match
2024-09-13 17:20:41,550 | pinder.core.loader.loader:426 | ERROR : Failed to apply sub_filter=FilterSubRmsds on 4ag4__A1_Q08345--4ag4__C1_UNDEFINED: Tried fallback, but number of CA atoms does not match
2024-09-13 17:20:41,550 | pinder.core.loader.loader:426 | ERROR : Failed to apply sub_filter=FilterSubRmsds on 4ag4__A1_Q08345--4ag4__C1_UNDEFINED: Tried fallback, but number of CA atoms does not match
2024-09-13 17:20:41,550 | pinder.core.loader.loader:426 | ERROR : Failed to apply sub_filter=FilterSubRmsds on 4ag4__A1_Q08345--4ag4__C1_UNDEFINED: Tried fallback, but number of CA atoms does not match
2024-09-13 17:20:41,550 | pinder.core.loader.loader:426 | ERROR : Failed to apply sub_filter=FilterSubRmsds on 4ag4__A1_Q08345--4ag4__C1_UNDEFINED: Tried fallback, but number of CA atoms does not match
2024-09-13 17:20:41,550 | pinder.core.loader.loader:426 | ERROR : Failed to apply sub_filter=FilterSubRmsds on 4ag4__A1_Q08345--4ag4__C1_UNDEFINED: Tried fallback, but number of CA atoms does not match
2024-09-13 17:20:41,550 | pinder.core.loader.loader:426 | ERROR : Failed to apply sub_filter=FilterSubRmsds on 4ag4__A1_Q08345--4ag4__C1_UNDEFINED: Tried fallback, but number of CA atoms does not match
2024-09-13 17:20:41,550 | pinder.core.loader.loader:426 | ERROR : Failed to apply sub_filter=FilterSubRmsds on 4ag4__A1_Q08345--4ag4__C1_UNDEFINED: Tried fallback, but number of CA atoms does not match

This is of course because a new random index to load at is not in fact sampled, so the same record is tried to be loaded 10 times. Thus, the maximum number of attempts gets exhausted and that results in the IndexError and failure to load more data. This is caused by the fact that filtering fails at apply_dimer_filters and the iteration simply gets skipped immediately after, so the index to load stays the same. I have simply repeated the logic of failure at apply_structure_filters to choose a new index for the next attempt.

danielkovtun commented 2 months ago

Thanks for reporting the bug and proposing an initial patch! Very much appreciated.

This issue has been addressed in another PR, along with a couple minor fixes underlying the reason for the filter failure in the first place. See my comments here: https://github.com/pinder-org/pinder/pull/18#issue-2543449320

I will go ahead and close this PR, but let us know if you're still running into any issues!