lean-dojo / ReProver

Retrieval-Augmented Theorem Provers for Lean
https://leandojo.org
MIT License
218 stars 48 forks source link

Handling of in-file negatives in ReProver #46

Closed luan-xiaokun closed 7 months ago

luan-xiaokun commented 7 months ago

Hi, thanks for open releasing the source code of ReProver and providing a very helpful guide on training the model. I notice that the in-file negatives are critical for the performance of ReProver, as explained in the paper, and I'm trying to understand how this is implemented.

As far as I understand, given an example ex, this for-loop iterates over all premises that are present in the file where the context belongs, i.e., all premises in ex["context"].path.

https://github.com/lean-dojo/ReProver/blob/0dbb82e3507cb8303dbb550bdb96bfbbb37e1ced/retrieval/datamodule.py#L105-L112

Here is what I found a bit confusing: when ex["pos_premise"].path != ex["context"].path (i.e., when the premise is imported from some other file instead of defined/proved in the same file), all of these premises we are iterating on will be added to premises_outside_file, instead of premises_in_file. This seems a bit counter-intuitive, because these premises actually come from the same file as the context, could you please explain a bit more on this?

yangky11 commented 7 months ago

Hi,

Thanks for your interest in our work. "In-file negative" means the negative premise is from the same file as the positive premise. Therefore, if the positive premise is from the current file (i.e., ex["pos_premise"].path == ex["context"].path), all negative premises in the current file are in-file negatives.