huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
129.81k stars 25.79k forks source link

Reconstruction statements mentioned in the paper #30532

Closed liumc14 closed 1 month ago

liumc14 commented 2 months ago

Hello, the data set here is the squad data set, but the three domain data sets created in the paper do not seem to be reflected in the code, and it seems that the reconstruction statements in the three domain data sets disclosed in the paper are in the source and It's the same in target. Why is this? @shamanez https://github.com/huggingface/transformers/blob/73014b561d5f88d728e46a57d346f516fefe3f2d/examples/research_projects/rag-end2end-retriever/utils_rag.py#L62

amyeroberts commented 2 months ago

Hi, thanks for raising an issue!

This is a question best placed in our forums. We try to reserve the github issues for feature requests and bug reports.

shamanez commented 2 months ago

@liumc14, you are correct. I open-sourced this code before my paper. Also, to keep the architecture clean, I didn't add the reconstruction statement.

But it is pretty straightforward

  1. Mix the data QA and Recon while having an identifier.
  2. Then, during the forward computation, only use the retrieved documents as the inputs to the Generator when training data is related to the reconstruction signal.
liumc14 commented 2 months ago

@liumc14, you are correct. I open-sourced this code before my paper. Also, to keep the architecture clean, I didn't add the reconstruction statement.

But it is pretty straightforward

  1. Mix the data QA and Recon while having an identifier.
  2. Then, during the forward computation, only use the retrieved documents as the inputs to the Generator when training data is related to the reconstruction signal. @shamanez But in the three domain-specific data set download links you provided in the paper (https://drive.google.com/drive/folders/1up3yKcJFArBQ6e0F_6n_mfW1VPHxA20A), I found after downloading the data set that the reconstruction in the .source file in the training set The statement has the same result as in .target, for example:

    American Civil Liberties Union, ACLU of Arizona, National Immigration Law Center slam law. American Civil Liberties Union, ACLU of Arizona, National Immigration Law Center slam law. In this case, rebuild the statement Can it still be used for training?

shamanez commented 2 months ago

Yes, the statement should be re-constructed. But the input to the generator should be the retrieved docs related to the statement.

liumc14 commented 2 months ago

Yes, the statement should be re-constructed. But the input to the generator should be the retrieved docs related to the statement.

@shamanez So the training of reconstructed statements actually involves inputting reconstructed statements, retrieving related documents, and letting the generator generate reconstructed statements based on the relevant documents? Thank you for your advice

shamanez commented 2 months ago

Correct

On Tue, 30 Apr 2024 at 12:42 PM, liumc14 @.***> wrote:

Yes, the statement should be re-constructed. But the input to the generator should be the retrieved docs related to the statement.

@shamanez https://github.com/shamanez So the training of reconstructed statements actually involves inputting reconstructed statements, retrieving related documents, and letting the generator generate reconstructed statements based on the relevant documents? Thank you for your advice

— Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/30532#issuecomment-2083971255, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGTMZZC6ZSSN744OGBLY73SHXAVCNFSM6AAAAABG5XZ6RSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBTHE3TCMRVGU . You are receiving this because you were mentioned.Message ID: @.***>

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.