[ ] There are a pre-existing file and dir that get used much later which are confusing when you ls. Could they be downloaded automatically when you get to the step that needs them?
AFAIK, there is no such katacoda feature. We can put process.py to another repository and download but that's overkill IMHO.
[x] "in Git, DVC remote storage config saved in Git" -> "in Git, and DVC remote storage config"
[x] "needed to access and download" -> "needed to access" - but this whole sentence is too long, could be rewritten.
[ ] ls -lh -> just ls?
Using -lh is to show these are identical files
[x] "dvc get automated this by reading" - This explanation would make more sense before the wget example
[ ] .dvc/config and get-started/data.xml.dvc links - Should it open the in-system IDE instead?
These two are remote files. It's better they are opened in the browser.
[x] "at the dataset-registry you cannot find it" -> "at the dataset-registry, you cannot find the file"
[x] "stored in a data storage" -> "stored in a DVC remote"
Step 3:
Now this is Step 2
[x] "if you look at the Get Started repository" -> Should be [Data Registry]
[x] "dvc get can download them, but how do we first even know what exactly there before downloading (or accessing in other ways we'll cover later)?" -> "We can dvc get them, but how do we even know what data is tracked in a remote DVC repo before accessing it?"
[x] "we pass Git URL" -> "we pass a Git URL"
[x] "as with dvc get" -> "as dvc ge"t"
[x] "Now, you can see the data.xml file. As well" -> "Now we can see data.xml and"
Step 4:
[x] "Alternatively to the command line dvc get" -> "Besides using dvc commands"
[x] "with dvc.api" -> "with the Python API" (same link)
[x] Install dvc first, I think
In step 1, we installed and initialized. This scenario is actually a relic. :))
Now it installs from pip anyway. I timed apt, snap and pip, and the latter seems faster.
[x] cat process.py... - Use IDE instead?
[x] "Yes, the interface" -> "The interface"
[x] "works similar" -> "works the same way"
[x] "It doesn't consume space for a file on the file system - it reads data directly into memory" -> "open() doesn't consume space in the file system - it streams data into memory as needed"
[x] "Means, you can" -> "This means that you can"
[ ] But this 3rd point is kind of repetitive vs the 2nd one, may want to rephrase a bit
I changed the second one's verb to load. Point 2 is having no disk space requirement, point 3 is low memory footprint.
[x] "interface is the same" -> "the interface is the same"
Step 5:
[ ] I'm not sure we even need the pre-existing example-get-started dir. Why have that an doverwrite data/data.xml? Just to match https://dvc.org/doc/start/data-access#download? The rest of the scenario doesn't match the GS anyway.
You're right. I don't know the reason but the initial reference was GS docs, I think. I butchered the repo in later scenarios but this one had some content and I checked only the commands actually. The text is not mine.
[x] "simplified" -> "simplifies"
[x] "complexity" -> "the complexity"
[x] "How about ..." -> "What about datasets or ML models?"
[x] "DVC repositories and dvc import command" -> "A DVC repository and the dvc import command"
[x] "The url and rev_lock subfields" - Needs more context (mention dvc.xml.dvc)
[x] git diff -> could just be cat data/data.xml.dvc. It's not clear why we're comparing something.
[x] "dvc import, is" -> "dvc import is"
[x] "repository this" -> "repository, this"
Step 6:
[x] Not sure we need it since we've already mentioned and linked to the Data Registry pattern (use case).
This one fixes #28
https://katacoda.com/dvc/courses/get-started/accessing
Step 1:
process.py
to another repository and download but that's overkill IMHO.Step 2:
Step 3:
Now this is Step 2
[x] "if you look at the Get Started repository" -> Should be [Data Registry]
[x] "dvc get can download them, but how do we first even know what exactly there before downloading (or accessing in other ways we'll cover later)?" -> "We can dvc get them, but how do we even know what data is tracked in a remote DVC repo before accessing it?"
[x] "we pass Git URL" -> "we pass a Git URL"
[x] "as with dvc get" -> "as dvc ge"t"
[x] "Now, you can see the data.xml file. As well" -> "Now we can see data.xml and"
Step 4:
[x] "Alternatively to the command line dvc get" -> "Besides using dvc commands"
[x] "with dvc.api" -> "with the Python API" (same link)
[x] Install dvc first, I think
In step 1, we installed and initialized. This scenario is actually a relic. :))
Now it installs from pip anyway. I timed
apt
,snap
andpip
, and the latter seems faster.[x] cat process.py... - Use IDE instead?
[x] "Yes, the interface" -> "The interface"
[x] "works similar" -> "works the same way"
[x] "It doesn't consume space for a file on the file system - it reads data directly into memory" -> "open() doesn't consume space in the file system - it streams data into memory as needed"
[x] "Means, you can" -> "This means that you can"
[ ] But this 3rd point is kind of repetitive vs the 2nd one, may want to rephrase a bit
I changed the second one's verb to load. Point 2 is having no disk space requirement, point 3 is low memory footprint.
[x] "interface is the same" -> "the interface is the same"
Step 5:
Step 6: