kbressem / medAlpaca

LLM finetuned for medical question answering
GNU General Public License v3.0
491 stars 57 forks source link

The link to Stackexchange datasets are no longer working #22

Closed s1ghhh closed 1 year ago

s1ghhh commented 1 year ago

Thank you for open-sourcing such a fantastic project. Since the links in the README are no longer working, I would like to know where I can access the StackExchange dataset series.

kswanjitsu commented 1 year ago

I just came here to ask the same thing. I'm guessing it was TOU / TOS?

kbressem commented 1 year ago

Sorry for the inconvenience. I took it down because (1) I wanted to further clean the dataset and (2) I did not provide the source for all answers, which would violate the license. However, Hugginface now hosts a much better StackExchange dataset at https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.

s1ghhh commented 1 year ago

Sorry for the inconvenience. I took it down because (1) I wanted to further clean the dataset and (2) I did not provide the source for all answers, which would violate the license. However, Hugginface now hosts a much better StackExchange dataset at https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.

Thank you for your response, and I understand your situation. I have reviewed the link you shared, and the answers in this dataset are often lengthy, meaning that one question typically corresponds to multiple answers. May I ask if you set one question to correspond to one answer? Additionally, this data contains many HTML tags and links. Will you remove them? Once again, thank you for your response and sharing.

sarapieri commented 1 year ago

Hi @kbressem, I read the previous messages. In this case, will the dataset be made available in the next future?

kbressem commented 1 year ago

No. The dataset on Hugging Face is really good and I see no benefit an uploading another crawl of the same data. Please give the Hugging Face dataset a try.