Restructuring the Word Lists

abuango commented 2 years ago

This PR resolves https://github.com/inclusivenaming/website/issues/131 & https://github.com/inclusivenaming/website/issues/128.

The current wordlist pages contain all words listed on a single page for each tier, which is difficult to scale. In the last Language workstream meeting, there were suggestions on how it will be more efficient to have individual files per word and a listing of the words per tier.

This PR solves the following:

Provides a template for adding new words
Restructure the file listing into wordlist > tier > word files (HTML & JSON], where the wordlist and tier folders contain index files for listing the words in each tier
Generate both HTML and JSON files for each word and the tiers. The JSON files can be consumed as an API endpoint.

Reference:

Build a JSON API With Hugo's Custom Output Formats

markcmiller86 commented 2 years ago

@abuango Great Work 💪🏻

This looks like a wonderful start 🎉

I went looking for a preview URL and found only this one which either I don't understand or is not fully working. Do you have a full preview URL anywhere handy?

abuango commented 2 years ago

@abuango Great Work 💪🏻

This looks like a wonderful start 🎉

I went looking for a preview URL and found only this one which either I don't understand or is not fully working. Do you have a full preview URL anywhere handy?

Oh, I didn't know it generated a preview. It's not added to the navigation yet, I tried to figure out the structure first before polishing it but here is the preview link to the skeletal test word page and here is what the JSON file will look like.

The difference between both is the file extension:

wordlist/tier-1/abort/index.json

/wordlist/tier-1/abort/index.html

Hugo will generate both files from the markdown file of the word. I will clean it up tomorrow so we can have a preview.

markcmiller86 commented 2 years ago

@abuango thanks for the links to the word page and json file.

Can you confirm...do authors of new word recommendations continue to compose their recommendations in markdown (maybe as a separate file here in the proper wordlist/tier-X directory and then that information gets used to generate the json file (for API endpoints) and word-list file? If so, that sounds cool.

I might recommend that the json file contain a sparing amount of information such as just...

the word (or phrase) to be replaced
- Do we need to think a bit more about how to best store this information in json context to facilitate downstream matching and replacement automation? Some issues I can think of are case sensitivity, punctuation (if the phrase has any), singular/plural, derived words (e.g. segregate vs. segregation)
its tier (1, 2 or 3)
the INI recommended replacements (perhaps as an ordered array)
URL to the INI recommendation page where the word (or phrase) is reviewed.

abuango commented 2 years ago

Can you confirm...do authors of new word recommendations continue to compose their recommendations in markdown (maybe as a separate file here in the proper wordlist/tier-X directory and then that information gets used to generate the json file (for API endpoints) and word-list file? If so, that sounds cool.

Yes, new word recommendations are created using markdown in the respective tier folder.

And thank you for the recommended fields for JSON, it will be cleaner that way.

markcmiller86 commented 2 years ago

It looks like quite by coinicidence another developer, @jamesgeddes, has proposed an example of how the json file should be structured in this issue which is on the INI org repo.

jamesgeddes commented 2 years ago

Thanks @markcmiller86 !

As I have suggested in https://github.com/inclusivenaming/org/issues/108 I would suggest that the main version of the INI suggested language list should live in its own repo. This then allows any client, including the INI website, to use it as a single source of truth.

Muddling it in with the INI website could make things unclear.

jamesgeddes commented 2 years ago

Additionally, I would suggest that the process(es) for adding new terms should be kept separate to the wordlist itself, so it would be best practice to separate these two features into two PRs.

Another benefit of having it in its own repo is that updates can be done both via PR and via a GUI.

jamesgeddes commented 2 years ago

Regarding the efficiency of "individual files per word", I would suggest that simply allowing clients to download one file and for over it is probably a more simple solution than building out an INI API, which would require

many calls per client
additional hosting costs for the INI

markcmiller86 commented 2 years ago

@jamesgeddes in comments 1 and 2 above you have proposed ideas for logistical questions that are far beyond my bailiwick. I can be sure to bring these questions to other's attention though.

On the specific issue of having the json file in its own repo...I see your point about it maybe being hard to find here. But, I think that is perhaps fixable using other approaches. That said, our intention is for it to become an auto-generated work-product and the true source of INI language recommendations remains the hosted web pages which include the author-crafted (and researched) word/phrase recommendations.

jamesgeddes commented 2 years ago

@markcmiller86 The true source must be the one that everyone reads from, which would be the JSON main. The INI website would be a client of it, compiling the list based on the JSON. Separately, it would also have the ability to also write to it. This would still allow the website to generate and update the JSON without humans needing to manually write JSON.

From a human perspective, it makes zero difference, its purely a technicality.

markcmiller86 commented 2 years ago

Regarding the efficiency of "individual files per word", I would suggest that simply allowing clients to download one file and for over it is probably a more simple solution than building out an INI API, which would require

Sorry...perhaps I mis-wrote. What I think we mean is that the machine readable version of published and released INI recommendations will take the form of the json file...which will be released on a still TBD periodic basis. Perhaps for each release, we host that file somewhere other than the website repo. Downstream tools just take up that file.

jamesgeddes commented 2 years ago

Downstream tools just take up that file

I think we might be circling around agreement here 😂

jamesgeddes commented 2 years ago

Here is a very rough sketch of my suggestion.

ini-json

Using this method, the INI would still be able to ensure that the SSoT is updated via GUI, but it would get the added benefits of

version control
separation
clarity
openness

abuango commented 2 years ago

I understand the approach @jamesgeddes is proposing and I agree with separating the wordlist from the website, it makes contributing to it easier and can form a single source of truth that the website also consumes from. The proposed wordlist repo can be owned by the Language workstream and used mainly for the wordlist, separating it from the website repo. Like @markcmiller86 mentioned, this will be shared with the rest of the folks in the workstream in the next meeting.

My only concern about maintaining a single file is the scalability as the wordlist grows, yes it makes it a lot easier for the clients but maintaining a single file that has the potential of growing into thousands of lines in the nearest future is something that needs to be carefully looked into. I will suggest at the very least organising them into separate files for each tier if individual word files is a major deal-breaker. cc: @quaid

abuango commented 2 years ago

@jamesgeddes The next Language Workstream meeting is April 26, 2022 at 11:30am Pacific time. You are welcome to join the call when we discuss this.

jamesgeddes commented 2 years ago

@abuango For me, the deciding factor is how often we would be updating the list. If a new list version is likely to occur every day, then I wholeheartedly agree that separate files makes sense. If it is every month, then new versions can be compiled into a staging branch before they make it into the main branch. Glad we agree about separating it into its own repo 🙂

I'll be at the meeting next week, thanks for the invite!

abuango commented 2 years ago

Today's meeting didn't hold so, I could not share the PoC of the WordLists page. A preview is available and currently contains some real data mixed with test data, so we can see how it works. A page has also been created for the Word List term template. The key concept behind this template is the use of frontmatter entirely without content, Hugo generates the HTML and JSON file for each term using the data supplied in the frontmatter.

quaid commented 2 years ago

Hey @abuango we've had some missed meetings and lost track of closing this discussion. It does seem this is doing what we discussed in that previous meeting, thank you so much. I'll look through and see if I have any questions, and if we need a meeting to decide or can do it async.

abuango commented 2 years ago

Feedback from Workstream call:

[x] Add more details to the Template page, so contributors can be well educated on how to new content
[x] Single JSON file for all terms
[x] Add "Replacement terms" to JSON file
[x] Move No-Change Tier down on the overview page and remove 0
[x] Move "All terms" section on the overview page to the right and make it more visible.

abuango commented 2 years ago

@LarryKunz @markcmiller86 I have implemented the feedback from our last meeting, kindly review.

inclusivenaming / website

Restructuring the Word Lists #134