There are 157k synthetic discharge summaries in the dataset on Huggingface, but there are 167k patient summaries extracted from case reports in PubMed Central (PMC). I'm wondering if the missing summaries are random, located at the bottom, or if they were dropped due to low quality?
We filtered out some low-quality note-question answer pairs based on the criteria below:
Extremely short synthetic notes.
One of the note, question, or answer includes some alignment phrase such as: "As an AI" or "Sorry, I cannot generate."
As a result, about 10k pairs were opted out.
Hi Junu Kim,
Are the two datasets matched by the rows?
There are 157k synthetic discharge summaries in the dataset on Huggingface, but there are 167k patient summaries extracted from case reports in PubMed Central (PMC). I'm wondering if the missing summaries are random, located at the bottom, or if they were dropped due to low quality?
Best, Jun Hou