Create lists of commands to test coverage parity against

leostera commented 8 years ago

We should absolutely leverage the online linux man pages to periodically fetch a big, big list of commands.

Sample: http://linux.die.net/man/1/ has almost 10,000 commands.

waldyrious commented 8 years ago

We could make separate projects to track commands based on platform (since they overlap, we can't use milestones, which is a pity since it would give us a nice progress bar)

Linux:

(now moved to the spreadsheet mentioned below)

Windows:

OS X:

be5invis commented 7 years ago

@waldyrious For Windows, commands in CMD and PowerShell are DIFFERENT. For example, dir is a CMD built-in, also an alias of Get-ChildItem in PowerShell. (Even ls in PowerShell is an alias, though they are going to remove it.)

waldyrious commented 7 years ago

@be5invis thanks for bringing that up. It is certainly something we need to consider (e.g. we currently treat all linuxes the same, even though some of the commands are shell-specific). See #190 and #816 for previous discussion.

That said, that problem does not affect this issue: the former deals with how we organize the command pages we do have, while this issue is about identifying which commands we don't yet have, but should.

be5invis commented 7 years ago

@waldyrious The full PowerShell commands on my PC: https://gist.github.com/be5invis/57d906e6f6935f7a1f19279878c2c214

sbrl commented 7 years ago

If the tldr client emits a different exit status depending on whether the page exists or not (like tldr-bash-client does, then we could have an semi-automatic bash script that runs through a list of commands and emits a list of commands that don't exist yet. I could even write something like that & create a gist quite easily.

It would certainly help people who want to contribute find a page that needs doing.

agnivade commented 7 years ago

You can always check the files present in the repo itself for parity no ?

sbrl commented 7 years ago

@agnivade Yeah, we could do that too! Do a git clone and then afind tldr -iname "command.md"` or something

sbrl commented 7 years ago

Executing

var els = document.querySelectorAll("dt a[href]");
var cmds = [];
for(let el of els) cmds.push(el.innerText);
cmds.join("\n");

on http://linux.die.net/man/1/, gives this file: linux-commands.txt

This is obviously pending sorting, which I'll do soon.

sbrl commented 7 years ago

Sorting complete! Here's what I came up with:

cat linux-commands.txt | xargs -P4 -I {} bash -c 'if [[ "$(find tldr/pages/ -name {}.md | wc -l)" -ne 0 ]]; then echo yep>>yeses.txt; else echo nope>>nos.txt; fi'
echo We have $(cat yeses.txt | wc -l) out of $(cat linux-commands.txt | wc -l) commands in tldr-pages -  $(cat nos.txt | wc -l) commands are missing.

Running the above reveals that:

We've got 328 commands documented
We've got 9497 to go
We've done ~3.34% so far

waldyrious commented 7 years ago

I wonder if, after we've compiled one or more lists of commands to add, we could somehow calculate the completeness percentages automatically and display them in the README with a badge.

If we do compile multiple lists, we could even organize the completion badges in a table to provide a dashboard similar to the progress table of Wikipedia's WikiProject Missing encyclopedic articles.

Does anyone have an idea whether something like that is doable and/or hints about how to go about implementing it?

agnivade commented 7 years ago

I would like to take a stab at this. I am thinking of just taking the GNU coreutils list and test parity against it. The linux.die.net page contains a lot commands which have to be installed separately.

The badge thing can be easily done with a custom svg element.

waldyrious commented 7 years ago

@agnivade I can't wait to see what you come up with! I'm more than willing to provide the actual content of the lists if that takes some work off your plate (I have a bunch of notes and links in a google doc, besides the resources I listed above).

agnivade commented 7 years ago

Sure, that would be great.

sbrl commented 7 years ago

Oooh, awesome :D

agnivade commented 7 years ago

@waldyrious - I might take a stab at it this weekend. Can you share the links/notes that you have ?

waldyrious commented 7 years ago

Sure. I'll block off one hour to work on this today, and will post the resulting data.

waldyrious commented 7 years ago

Heads-up: the wiki page "Pages plan" has been deleted to centralize tracking of missing pages in this thread. I've moved all the information that was present there to this spreadsheet, which is publicly viewable and anyone can add comments. It's a work in progress (I just started it). I'll give write access to the current maintainers.

sbrl commented 7 years ago

@waldyrious Wow, that's an impressive spreadsheet! Is there a filter for just the ones that haven't been done yet? How are bulk lists of commands added to the list?

waldyrious commented 7 years ago

There will be a filter, yeah -- that's one of the reasons I've decided to build it in a spreadsheet. The lists of commands will be added manually (using various helper tools, of course), since the various sources don't use a common format. Let me know on Gitter if you'd like to work on this so we can coordinate.

agnivade commented 7 years ago

I am concerned about how do I get the total list of commands programmatically. Since I would like to run the list against every commit merged with master.

waldyrious commented 7 years ago

That document is by no means meant to be the final location of the list. It's just the way I figured would be easiest to get it started and quickly filling it. I don't know yet what setup would be the best balance of (1) community maintenance of the data, (2) machine consumption of the contents, (3) automatic synchronization (as much as possible) as new pages are added. Ideas are welcome.

Also, the choice of how to set this up would depend on how often we would want to update the list. I think we can start with something reasonably static, to make things easier, especially since we have a lot of work to catch up to established commands before it would make sense to start chasing more dynamic lists (say, top node.js-based CLI tools or something like that)

agnivade commented 7 years ago

(3) automatic synchronization (as much as possible) as new pages are added.

Umm no .. I think you got the wrong idea. :stuck_out_tongue_closed_eyes: We don't need to synchronize when new pages are added. That would be crazy. It seems like you put a lot of effort into this. Frankly, I didn't need so much details.

Here's what my plan is -

On every commit to master branch, run a script which will get all the commands in the repo, get a list of target commands we want to match our list with, calculate percentage and update the svg badge which shows the percent completion.

That's it. No need to update any list when new pages are added.

waldyrious commented 7 years ago

Hahah, yeah, I got a little carried away there. Although I might have given off the wrong impression.

The way I was planning to have this "automatic sync" feature was to simply open one issue per command to add, and assign them to milestones according to the lists they appear in. That way we'd get a nice live overview page with progress bars for each of the lists we'd want to reach parity with. For reference, my inspiration came from the overview table of Wikipedia's WikiProject Missing encyclopedic articles.

In addition one milestone per (major) source, we might also want platform-specific lists (windows commands, bsd, etc.), and maybe topic-specific lists (email clients, text editors, compilers, etc.).

Of course, this doesn't prevent us from having a "master completeness list" and use that to compute a single "overall completeness" metric. We do need to decide what goes into that list, though. The obvious choice is a metric of the most popular pages (e.g. the top 1000 entries sorted by how many of those lists they appear on), but let me know if you think something else would make more sense.

agnivade commented 7 years ago

Your idea seems like a lot of manual work, something which I personally would want to avoid. I was planning for that "completeness" metric and be done with it. If we indeed decide that it's just gonna be 1000 commands, then we might as well compute the list and check it in in the repo, so that my code can easily compare with it.

waldyrious commented 7 years ago

Sure, as I said, the list will probably not change much after we compile it. My idea is just a nice-to-have I might do on my own later on (unless you guys object).

For the master list, we just need to decide what is the criteria we'll use to define its contents -- from there it's just a matter of collecting the rest of the data and applying the filters.

So in that regard, what are your thoughts regarding which criteria to use: which lists to compare against, how many commands to include, etc?

waldyrious commented 7 years ago

Update: the table is pretty much ready now. Some areas that still need some help:

cells painted yellow indicate sources that haven't been integrated yet. There are four of them. Any help in that regard would be appreciated (just parsing those sources into a plaintext list of commands would suffice)
cells painted orange indicate expected counts (number of "x" marks in that column) that don't match the automated count. I'm not sure what's going on there, so it would be nice if someone with fresh eyes could double-check those columns.

Apart from that, we can start deciding how to use that data to compile our master list :)

Note: I didn't include the linux.die.net manpages, since even just the first section contains about 10,000 commands, which makes the table unwieldy and kinda overwhelming, to be honest.

waldyrious commented 7 years ago

By the way, the plan to use milestones won't be possible after all. I had already reached this conclusion before, but forgot it in the meantime: it turns out GitHub only allows a single milestone per issue, so there would be no way to simultaneously track progress towards multiple coverage parity goals :(

That said, we could still have a milestone for the master parity list, which IMO would be a good thing as it would make those missing commands more visible as issues that newcomers could tackle. (It could also be the target URL for the badge.)

agnivade commented 7 years ago

cells painted orange indicate expected counts (number of "x" marks in that column) that don't match the automated count.

So are you saying you have manually counted each 'x' just to verify the automated count ? That's some dedication ! Why bother with the manual count at all if there is already automation for it ? Unless you suspect that =COUNTIF(L5:L,"x") is wrong ?

waldyrious commented 7 years ago

So are you saying you have manually counted each 'x' just to verify the automated count ? That's some dedication !

Oh god no, haha :P I'm not that crazy ;) I had the correct count from the actual lists that can be seen in the "lists" sheet (number of lines, basically, which any decent text editor will provide), which gives me more confidence in the result than using the formulas. Besides, some of the automated counts indeed are correct. I found some duplicated entries before, due to imperfect filtering, and that fixed some of the mismatched counts -- but I can't figure out what's causing the remaining mismatches...

agnivade commented 7 years ago

Ah I see :) Didn't notice that there was another sheet.

sbrl commented 7 years ago

Awesome work! Yeah, perhaps we could have a 'current goal' to document all the commands in a given list, and keep moving to new lists as we complete old ones. Having a list of commands auto-generated that have yet to be documented for the 'current goal' parity list would be helpful for newcomers, yeah.

The sheet is rather unwieldy though on my screen, since the frozen panes take up about 60% of my available screen real-estate on my laptop :confused:

agnivade commented 7 years ago

I think we should move the orange and yellow cells to a new row below. Because its in the same row as coverage. And it just signifies the expected count, not coverage.

And lastly, our current coverage % is 52 right ?

waldyrious commented 7 years ago

The sheet is rather unwieldy though on my screen, since the frozen panes take up about 60% of my available screen real-estate on my laptop :confused:

I made the heading more compact. Is that workable now?

I think we should move the orange and yellow cells to a new row below. Because its in the same row as coverage. And it just signifies the expected count, not coverage.

Agreed, I just did that. Ideally we won't even have to include the expected count on the table, but until we figure out what's going on with the mismatched values, we'll need those cells.

waldyrious commented 7 years ago

Ok, I've filled the table some more. Also, apparently I can't reproduce the count mismatch anymore, so ¯\_(ツ)_/¯

The two sources that still need parsing into a plain list of command names are Inconsolation and ArchWiki's List of applications. Any help appreciated!

And lastly, our current coverage % is 52 right ?

Yes, but that's a plain fraction that doesn't consider the relative importance of the missing commands. I'd rather have a weighted coverage percentage, where each entry is weighted by the number of occurrences in these other lists. ~~I'll have that working in a bit.~~ Edit: done (see top right corner of the table). Looks like at this point it isn't much different from the plain percentage, though :stuck_out_tongue_closed_eyes:

sbrl commented 7 years ago

Through weird es6 magic, I bring you a list of commands for the Inconsolation lists! Here's the code I used in the firefox console for reference:

(function() {
let result = [];
document.querySelectorAll(".entry-content > p:nth-child(4) a[href]").forEach((el) => {
    if(el.innerText.search(":") === -1 || el.innerText.trim()[0] !== el.innerText.trim()[0].toLowerCase() || el.innerText.search(/\./) !== -1) return;
    result.push(...el.innerText.split(":")[0].split(/\s*(and|,)\s*/gi));
});
result = result.filter((cmd) => cmd.search(/[,\*\(\) \{\}]|and/) == -1 || cmd.length == 0);
console.log(result.filter((el, i, arr) => arr.indexOf(el) === i).join("\n"));
})();

...I've pasted them into the spreadsheet. They might need a little bit of tidy-up work though, since the input was messy.

That archwiki one though looks tough, since they don't detail the name of all commands in the list.

waldyrious commented 7 years ago

Can you explain the code? I'm afraid just parsing the link titles will produce a list with way too many missing entries, because many of the titles don't contain command names directly. On the other hand, I'm not sure I can think of anything that would work better without involving manual processing of each page linked from the entries... :confused:

As for the ArchWiki page, I guess it would suffice to extract only the contents of the sections titled "Console". That will definitely leave some gaps in the output, but the page isn't meant to be a structured list anyway, nor it focuses specifically on command line programs, so I guess it's reasonable to parse it more loosely.

sbrl commented 7 years ago

It is a bit messy, isn't it! :stuck_out_tongue: What it does is extract the names of the commands listed on the page, since I assumed that it was an index of all the commands the author had talked about. It discards the following:

Items in the list with a capital letter at the beginning
Items without a colon

Once done, it extracts the bit before the colon and does the following:

Splits it on , and and
Discards any parts containing and, ,, (, ), {, },
Discards any parts that have a length of zero.

gingerbeardman commented 6 years ago

Some related discussion here: https://github.com/tldr-pages/tldr/issues/1953

agnivade commented 6 years ago

This has been pending too long ! I will go on vacation soon, I promise to work on this during that time !

gingerbeardman commented 6 years ago

Enjoy your vacation! If the two coincide, then so be it 👍

tldr-pages / tldr

Create lists of commands to test coverage parity against #1070