.gitignore the output files from intermediary steps

scribe-org / Scribe-Data

Wikidata, Wiktionary and Wikipedia language data extraction

GNU General Public License v3.0

23 stars 25 forks source link

.gitignore the output files from intermediary steps #97

Closed wkyoshida closed 6 months ago

wkyoshida commented 6 months ago

Terms

[X] I have searched open and closed issues
[X] I agree to follow Scribe-Data's Code of Conduct

Issue

In looking at some recent PRs, wondering if it could make sense to add some of the output files from intermediary steps of the data process into the .gitignore, so that they don't get committed in by mistake. An example of such a file would be the nouns_queried.json, that is an intermediary step to getting the formatted nouns.json file.

One idea could be to:

For all files that we do not wish to add to version control, name them with a specific format, e.g. scribe_nouns_queried.json or mid_step_nouns_queried.json
Then add to the .gitignore an entry with a wildcard path for all files with that format -> **/scribe_*.json or **/mid_step_*.json for the equivalent formats above

Would this make even make sense @andrewtavis? Wondered about this especially since folks are contributing more to data processes.

andrewtavis commented 6 months ago

I think this is something to consider, @wkyoshida :) I'm a bit confused as to why these files are still being generated as they're destroyed at the end of the formatting steps. We could just do nouns_queried.json and the ones for the other word types though? Not sure why we'd need the intermediary or Scribe names as they're already distinctly named :)

shashank-iitbhu commented 6 months ago

I think this is something to consider, @wkyoshida :) I'm a bit confused as to why these files are still being generated as they're destroyed at the end of the formatting steps. We could just do nouns_queried.json and the ones for the other word types though? Not sure why we'd need the intermediary or Scribe names as they're already distinctly named :)

The {data_type}_queried.json files are correctly being deleted at the end of the formatting process. In PR #93, it appears that this file was accidentally committed after running the query process explicitly. Just adding {data_type}_queried.json to .gitignore for the data types nouns, verbs and prepositions would cover for such accidental commits.

wkyoshida commented 6 months ago

.. Not sure why we'd need the intermediary or Scribe names as they're already distinctly named :)

Gotchu! Yeah - that idea came about mostly if there were any other files I missed and wasn't remembering them :laughing: if they happened to have differing formats, a singular format would allow a single .gitignore entry to cover all of them, but perhaps I was over-complicating things :sweat_smile:

If the only files in question are the {data_type}_queried.json though, then as @shashank-iitbhu suggested a simple **/*_queried.json already covers this for us :rocket:

This is simple. I can do it this week, but if anyone would like to jump in before me, feel free by all means :grin:

andrewtavis commented 6 months ago

98c899d closes this up :) Wanted to close out some things as there's LOTS going on right now 😊