google-deepmind / alphageometry

Apache License 2.0
4.19k stars 470 forks source link

question on synthetic data generation to training data #92

Open shufan1 opened 8 months ago

shufan1 commented 8 months ago

After a synthetic proof is generated, how is it used as training data? For example, if a synthetic proof has N auxiliary constructions between the premise statement si and the conclusion statement sN+1, would you make multiple training data entries from this proof by taking every intermediate statement from doing one auxiliary point construction at a time? i.e. you would generate N-1 data entry from this single N-construction proof:

If each single data entry follows this format

I believe the paper says it made 100M proofs, of which 9M have at least one auxiliary construction. It later also says the fine-tuning used 9M data. How are those proofs with multiple constructions handled? I assume the transformer model only predicts one auxiliary construction at a time. I might have misunderstood this part. Please let me know if I can clarify my questions. Thanks so much for your help.

ParthaEth commented 8 months ago

How do you generate the random proofs in the first place? I mean concretely not conceptually. Is there a code snippet?