starmpcc / Asclepius

Official Codes for "Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes"
89 stars 6 forks source link

How to address discrepancies between synthetic notes with the PMC-Patients summaries? #6

Open JuneHou opened 6 months ago

JuneHou commented 6 months ago

Hi Junu Kim,

Are the two datasets matched by the rows?

There are 157k synthetic discharge summaries in the dataset on Huggingface, but there are 167k patient summaries extracted from case reports in PubMed Central (PMC). I'm wondering if the missing summaries are random, located at the bottom, or if they were dropped due to low quality?

Best, Jun Hou

starmpcc commented 6 months ago

Hello, Jun.

We filtered out some low-quality note-question answer pairs based on the criteria below:

  1. Extremely short synthetic notes.
  2. One of the note, question, or answer includes some alignment phrase such as: "As an AI" or "Sorry, I cannot generate." As a result, about 10k pairs were opted out.

Thank you!