Closed delfireinoso closed 7 months ago
All imported Datasets are showing 0 size, Open-Orca/OpenOrca and tatsu-lab/alpaca from the Dataset Store, and databricks/databricks-dolly-15k, imported by the dialog from Hugginface on .cache there are the folders Downloads with 4 Gbytes, tatsu-labalpaca 46 Mbytes Open-Orcaopen_orca with 7, 22 Gbytes databricks___databricks-dolly-15k WITH 12,3 MBytes
So I understand the data has downloaded to your computer but I'm wondering if the app is not reading the dataset for some reason. Can you click on the Info button and confirm that there's valid data? When you click on Preview for the dataset do you see sample data from in the dataset?
I can reproduce the issue where if you enter a huggingface repo ID to download it shows as 0 bytes even though it looks ike the repo has downloaded correctly. Will update once I dig in more.
databricks/databricks-dolly-15k
instruction | context | response | category |
When did Virgin Australia start operating? | Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia’s domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney. | Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. | closed_qa |
Which is a species of fish? Tope or Rope | Tope | classification | |
Why can camels survive for long without water? | Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time. | open_qa |
Open-Orca/OpenOrca
🤗
id | system_prompt | question | response |
niv.242684 | You will be given a definition of a task first, then some input of the task. This task is about using the specified sentence and converting the sentence to Resource Description Framework (RDF) triplets of the form (subject, predicate object). The RDF triplets generated must be such that the triplets accurately capture the structure and semantics of the input sentence. The input is a sentence and the output is a list of triplets of the form [subject, predicate, object] that capture the relationships present in the sentence. When a sentence has more than 1 RDF triplet possible, the output must contain all of them. AFC Ajax (amateurs)'s ground is Sportpark De Toekomst where Ajax Youth Academy also play. Output: |
[ [“AFC Ajax (amateurs)”, “has ground”, “Sportpark De Toekomst”], [“Ajax Youth Academy”, “plays at”, “Sportpark De Toekomst”] ] |
|
flan.564327 | You are an AI assistant. You will be given a task. You must generate a detailed and long answer. | Generate an approximately fifteen-word sentence that describes all this data: Midsummer House eatType restaurant; Midsummer House food Chinese; Midsummer House priceRange moderate; Midsummer House customer rating 3 out of 5; Midsummer House near All Bar One | Midsummer House is a moderately priced Chinese restaurant with a 3/5 customer rating, located near All Bar One. |
tatsu-lab/alpaca
🤗
instruction | input | output | text |
Give three tips for staying healthy. | 1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 2. Exercise regularly to keep your body active and strong. 3. Get enough sleep and maintain a consistent sleep schedule. |
Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Give three tips for staying healthy. ### Response: 1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 2. Exercise regularly to keep your body active and strong. 3. Get enough sleep and maintain a consistent sleep schedule. |
|
What are the three primary colors? | The three primary colors are red, blue, and yellow. | Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: What are the three primary colors? ### Response: The three primary colors are red, blue, and yellow. |
These are the first rows of the preview, that lasted some rows not the complete dataset. The first one is form Hugginface, the other two from the Dataset Store
OK I've reproduced that when you download via the download HuggingFace button it is not saving the size correctly in the DB. The model shoudl still work. But when I downloaded from the DatasetStore it showed up with the correct size for me after download was complete. You're saying you downloaded Orca and Alpaca from teh store and they show size of 0?
Yes the two models from the Data Store show 0 size too
It's not always this way.
I had a more accurate result on a former test. I have made a test wiping the data completely
Perhaps if you download them after downloading the hugginface dataset they go wrong
I've re-downloaded databricks twice. Once it worked and once it did not. It seems the huggingface code we call to get the size sometimes doesn't have the data yet. I will see if I can find a pattern and either compensate or post something on the HF site.
I am trying to pull the number from huggingface but I guess sometimes huggingface doesn't supply it: https://docs.databricks.com/en/_extras/notebooks/source/deep-learning/hugging-face-dataset-download.html
I don't know why it works sometimes and not others, so will have to build in a backup or alternative way to check!
Added a check after model is downloaded. Closing this but will keep an eye open as we are going to fix a few things related to datasets, and also add more datasets.
After downloading a public dataset: databricks/databricks-dolly-15k, the Imported Datasets pane shows Size=0, but in the disc the data is complete