"is like streamlined and smart wget" -> "it like a smart wget"
"DVC projects hosted on Git repositories" -> "DVC repositories"
"initialized DVC repository to use get" -> "initialized DVC project to use dvc get."
"ML model or a whole directory with DVC we use" -> ML model, or a whole directory, with DVC we use" (commas)
"listed data.xml on .gitignore" -> "listed data.xml in .gitignore"
Step 2:
"keeps track of our data file:" -> keeps track of our data."
"For a data file named data.xml, DVC keeps the tracking information in data.xml.dvc." -> we can remove this sentence I think, since it's previously explained (should be obvious). It's important to keep a good balance between explanation and commands in Katacoda I think (less words if possible)
"this is valid YAML 1.2 file" -> "this is a valid YAML 1.2 file"
ALSO, this should probably be a note (block quite) by itself?
"There is a field named md5 in the file." -> Should be part of the next paragraph
"DVC uses md5 field" -> "DVC uses themd5 hash" (lots of "the"/"a" words missing in general, let's try ot watch that 🙂)
"address the file in its cache" -> "address the file in the cache"
Remove "The hash is a pointer to the content in DVC cache."
The whole explanation about the hash is a bit repetitive, can we boil it down to a single paragraph perhaps? Maybe 2 (one between cat data/data.xml.dvc and mdsum... and another one before tree .dvc/cache.
Try to use the same terms (e.g. "MD5") less often.
"using its MD5 hash as a file name" -> "using its hash as file path"
"default setting for DVC is to copy content both in the cache and the workspace" -> This is not correct. Reflink are the default when available. Also "copy content from the cache and the workspace" (but again, that's not right).
In general I'm seeing that the updates while improving explanations (great!) are also getting a little long and sometimes repetitive. I don't think Katacoda users will be wanting to read that much (there's https://dvc.org/doc/start for that 🙂). Let's try to simplify now that we have all the right details.
Step 3:
"share data files with team members... Remotes that are accessible by other systems or team members" - a bit repetitive
"central location" - Remotes don't have to be centralized
"safely" - We don't provide any security features
"DVC allows to set up... Remotes ... can be set up" - Repetitive
"another supported storage type" - Link to docs (and remove the disconnected note at the end of the step)
"remotes are cache and content storage locations" -> "remotes are data storage locations" ?
"as if they are code/text files" -> "as if they were code/text files"
"It's possible to use another directory in the same disk as a remote also." -> "It's also possible to use a directory in the file system as a remote."
"This allows fast backups and we'll use" -> "This allows fast backups. We'll use"
"These configurations" -> "This configuration"
"Let's commit the configuration" -> "Let's commit the changes"
Step 4:
"In order to see the status of tracked data and model files and if they are stored in remotes" -> Actually status -c only looks at changes vs. the remote ("the status of tracked data" sounds like local changes). The previous version of this doc had this correct, let's revert to that order?
Also, "As dvc status --cloud shows" is repetitive.
"up to date with local cache" -> "up to date with the local cache"
"from local .dvc/cache to" -> "from the project's cache to"
"uses default remote" -> "uses the default remote"
"use --remote option for dvc push" -> "use the --remote option of dvc push"
"After pushing let's check" -> "Let's check"
"content of /tmp/data-storage" -> "content of /tmp/data-storage (location of our remote)"
"As you can see the structure" -> "The structure"
"is similar" -> "are identical"
"them contains" -> "them contain"
"same files addressed by the same hash values" -> "same file paths and contents"
"nor data/ contains" -> "nor data/ contain"
tree -a -I .git - What is this command for? Doesn't seem related. Also, it can't be clicked on.
"from remotes we can" -> "from remotes, we can"
"Let's see the status of workspace now:" Probably can remove this sentence (should be obvious what status does at this point).
"data.xml is missing from the workspace but" -> "data.xml is still missing from the workspace, but"
"DVC has a single command..." - Try to simplify this long paragraph, could be one sentence I think.
Step 5:
"not what would you usually do" -> "not what we'd usually do"
"there are some RAM limitations" -> "there are limitations"
"to being able" -> "to be able"
"and update data/data.xml.dvc to match..." -> "and update data/data.xml.dvc."
"in Git history" -> "in the Git history"
"through the MD5 hash" -> "through the hash" (we don't always use MD5, see import/get)
"of the cached file that is also its name on the cache" -> "of the cached file, that is also its path in the cache" but we've already explained this a few times...
Step 6:
"Although DVC can work without a VCS" -> "Git serves as the VCS for text and code files. Although DVC can work without it"
And remove "Git serves as the VCS..." from the next paragraph
"history of .dvc" -> "the history of .dvc"
"whose paths in the workspace are missing" -> "whose paths are not referenced from the workspace"
I don't think we need the next paragraph much since this process has been explained quite a bit already by this point (in previous steps).
"of the dataset data/data.xml" -> "of data/data.xml"
HEAD^1 could be HEAD~ (more common I think)
"We can see that current hash and the hash value in data.xml.dvc is different." -> Should be part of the previous paragraph, and: "And it differs from the current hash of data.xml.dvc:"
"in data.xml.dvc we use" -> "in data.xml.dvc, we use:" (comma, colon)"
"and this copies" -> "This links"
"from from local cache" -> "from the cache"
"dvc checkout command" -> Remove extra line break and just say "dvc checkout", but...
"synchronizes data files in the workspace to match" - Repetitive, we've already explained this is what we're going to do.
"we can see that X and Y." - In general avoid this phrase, it can be considered condescending. Best to just make the statement plainly e.g. below:
"...the value in data.xml.dvc and the hash value of data.xml are identical." -> "The md5 value in data.xml.dvc and the hash ofdata.xml` should now match:"
"Instead of checkout we" -> "Instead of dvc checkout, we"
pull -> dvc pull
"dvc pull also downloads missing data into cache, while dvc checkout only can restore data that already in cache" ->"pulling also downloads missing data from remote storage, while dvc checkout can only restore data that's already in the local cache"
https://katacoda.com/dvc/courses/get-started/versioning
Step 1:
wget
"dvc get
."Step 2:
md5
hash" (lots of "the"/"a" words missing in general, let's try ot watch that 🙂)cat data/data.xml.dvc
andmdsum...
and another one beforetree .dvc/cache
. Try to use the same terms (e.g. "MD5") less often.In general I'm seeing that the updates while improving explanations (great!) are also getting a little long and sometimes repetitive. I don't think Katacoda users will be wanting to read that much (there's https://dvc.org/doc/start for that 🙂). Let's try to simplify now that we have all the right details.
Step 3:
Step 4:
status -c
only looks at changes vs. the remote ("the status of tracked data" sounds like local changes). The previous version of this doc had this correct, let's revert to that order? Also, "As dvc status --cloud shows" is repetitive./tmp/data-storage
(location of our remote)"tree -a -I .git
- What is this command for? Doesn't seem related. Also, it can't be clicked on.status
does at this point).data.xml
is still missing from the workspace, but"dvc checkout
links (or copies) from"Step 5:
data/data.xml.dvc
."import/get
)Step 6:
data/data.xml
"HEAD^1
could beHEAD~
(more common I think)data.xml.dvc
:"data.xml.dvc
, we use:" (comma, colon)"dvc checkout
", but...md5
value indata.xml.dvc and the hash of
data.xml` should now match:"dvc checkout
, we"pull
->dvc pull
dvc pull
also downloads missing data into cache, whiledvc checkout
only can restore data that already in cache" ->"pulling also downloads missing data from remote storage, whiledvc checkout
can only restore data that's already in the local cache"