github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
11.98k stars 4.15k forks source link

Language Stats skewed by not showing reStructuredText, SVG, etc. (but for instance TeX). #3228

Closed inoas closed 7 years ago

inoas commented 7 years ago

Could it be that github detects sphinx docs (reStructuredText) as JavaScript?

Ref: https://github.com/isaacs/github/issues/768#issuecomment-247761390

If that's the case then it probably accounts all restructured text (and thus sphinx docs) wrongly towards JavaScript and - in case you gather stats of javascript by this code - impacts the global stats wrongly.

pchaigno commented 7 years ago

Except for the jquery.js file, these files make up for the 44.6% of JavaScript (we count the number of lines, not the number of files). You can mark those files as vendored using Linguist overrides.

inoas commented 7 years ago

So re structured text is not accounted for. E.g. the whole source docs files are being ignored?

Alhadis commented 7 years ago

Not ignored. reStructuredText is excluded from the statistics bar because it's a prose language. I'll copy you an explanation I wrote last month:

There are four distinct classifications of language on GitHub:

  • Prose: Material intended to be consumed by a human reader, with little to no processing
  • Markup: Material intended to be presented to a reader after being processed or parsed by a program
  • Programming: Material with instructions designed to be interpreted or compiled by software
  • Data: Virtually anything that doesn't fit into the above

Language statistics are only affected by markup and programming languages. Data and prose are ignored, and they'd unfairly bloat a repository's language statistics.

Markdown is classed as "prose", as it's designed to be read in its raw form as easily as its processed form. HTML is classed as "markup", because it's not designed to be read in its raw form.

inoas commented 7 years ago

That is a really bad decision. Especially the difference between reStructured text and simple html documentation (markup) is marginal at best. It really de-evaluates community driven documentation and its used tools/languages.

Alhadis commented 7 years ago

It really de-evaluates community driven documentation and its used tools/languages.

That's a really silly thing to say, especially since GitHub has gone out of their way to support the displaying of rendered reStructuredText in repositories.

inoas commented 7 years ago

There is nothing silly about it. Either the stats are relevant or they are not. Disregarding reStructuredText is a slap in the face towards the docs teams, stuff where open source can shine, but doesn't always (due to lack of documentation).

Alhadis commented 7 years ago

It's not disregarding anything. These files are still calculated in repository breakdown graphs. But seriously, be realistic for a second: if every repository's stats were skewed by things like documentation, can you imagine how cluttered the statbar would become?

Sure, the end result looks strange if the repository is documentation-specific. But for the vast majority of codebases, documentation is ancillary.

Alhadis commented 7 years ago

I think you're seriously overreacting here, dude.

inoas commented 7 years ago

I'd love to keep it professional.

Realistically: The result of the stats is miss-leading and wrong. Either the stats matter and reflect what's going on in the repo and thus deserve their space in the UI. Or they are bad - and realistically - can be removed.

Realistically disregarding big chunks of contributions in FLOSS is skewing the stats.

But for the vast majority of codebases, documentation is ancillary.

So you are saying open source projects should host anything not purely code at some other place? Is that an official statement?

Suggestion

Instead of listing anything the maintainers of 'linguist'/github deem to be relevant, why not think about throwing away stats for things that are <3% or <5%, listing those under "other"? At least if it is more than one hit (for instance this repo shows <1% stats for Shell, which could stay that way, because it is just one category).

Alhadis commented 7 years ago

I'd like to point out I'm not GitHub staff, just a frequent contributor.

So you are saying open source projects should host anything not purely code at some other place?

GitHub's primary purpose is for hosting, sharing, and improving code. Not documentation.

Why the hell has this gotten you so worked up?

inoas commented 7 years ago

It gives a false perception on some repos. Docs matter. Maybe not to you, but just take a quick glimpse at stackoverflow and (even if you take out the people just to lazy to RTFM) - documentation is a core issue.

Then if you go to choosealicense.com (also featured officially by GitHub) they offer licenses for non code also. So before taking your word I want to see some GH official coming along saying "GitHub is for sharing and improving code and only code, the rest is accessory".

Until then I think the current way linguist discards relevant stats is totally miss-leading and disregarding important work of people working on docs (the guys paving the way for mainstream FLOSS).

Alhadis commented 7 years ago

I think you've completely misinterpreted the purpose of the stats-bar...

It's not to provide an exhaustive indication of everything a repository contains. It's to give a quick overview of what sort of code has been rewritten for a codebase. It rules out things like data, vendored/boilerplate files, and generated blobs. All of which are also included in the stats breakdown if you click on the stat-bar and see exactly what the repository contains.

inoas commented 7 years ago

To give another point of view. Is SVG code? Or is SVG art? SVG can be written by hand, animated through SMIL, CSS-Animations/Transitions and JS. Just because it is declarative it is also dropped from the stats?

Ref / Example: https://github.com/danleech/simple-icons

Alhadis commented 7 years ago

SVG is XML, and XML is chiefly used for serialising data.

inoas commented 7 years ago

I think you do not understand the impact of the stats bar. Either it is irrelevant (as I said before, and can simply be removed) or it gives the users a glimpse of what's in the repo. Either github wants to only be a platform for "executable code" (here the fun would start arguing that XSLT is executable like javascript is, or is it not?) (and people should find a different collaborative solutions elsewhere) or it should reflect current stats - at least in recent technology.

And it is not like reStructuredText or SVG are declining.

So SVG is serialized data? http://tutorials.jenkov.com/svg/svg-animation.html - moving serialised data then?

HTML is XML too (at least the good one, where you can mix and match SVG, MathML etc), so why show that in the first place in the stats bar?

inoas commented 7 years ago

A quick idea on a repo should tell me: "oh yeah, it got a lot of docs in here" - either bundled with the source code, or stand alone (oh no, this doesn't have a lot of written docs, but there is the link to the docs repo).

inoas commented 7 years ago

And could you explain how TeX is a "programming language" whereas SVG or reStructuredText are not? https://github.com/mhyee/latex-examples

Alhadis commented 7 years ago

Inoas, have a look at one of my repositories. I authored extensive documentation for a JavaScript module, and there's no way I'd want my repository being listed as "Markdown" because the extent of documentation outweighed lines of executable code.

I imagine many other programmers would feel the same.

Alhadis commented 7 years ago

Also, HTML isn't XML. It evolved from SGML, and allows certain syntax that would obviously be invalid in any XML-based language:

<ul>
    <li>Item 1
    <li>Item 2
    <li>Item 3
</ul>

Do you actually know how to write LaTeX?

here the fun would start arguing that XSLT is executable like javascript is, or is it not?

... XSLT is a programming language, and it's also categorised by the site like one. Not sure where you're coming from.

inoas commented 7 years ago

Simple thing:

I imagine you, as a programmer, can follow along the logic what can be done about this issue so that there is an almost Pareto efficient situation.

inoas commented 7 years ago

Excourse: It does not matter if HTML evolved from SGML for the sake of argument. XML is not represented because you say it is "just serialised data" but HTML is shown exactly why? It is nice for lazy authors to write HTML5 the SGML way, but you will lose all the flexibility of mixing different XML types and being able to use XML transformation and validation tools.

I have no problem for TeX/LaTeX appearing as part of the stats. All I am saying is that it is inconsistent and missleading and maybe the reason for it is simply:

I imagine many other programmers would feel the same.

But maybe - just maybe - you can realise that either github is a platform for the whole of open source projects, or you just consider it a hackers sharing and improving tool. I can see where you are coming from if that's the case. I just don't agree (and your whole argument, starting with calling other's statements silly without explaining isn't nice either).

If reStructuredText, SVG etc, were shown all you had to do was adding a simple override if you considered those files not worthy for stats. You are in control then. Right now, the repo owners cannot decide if they deem any kind of xml dialect, restructured text, and whatever else misses by filtering relevant. The choice has been made and it is quite elitist.

inoas commented 7 years ago

Anyway - enough wasted energy from both sides I bet, I think I made my point. I'd love to see relevant "source code" (even if it is not ASM) to be shown in the stats in future.

Alhadis commented 7 years ago

... there is no issue.

Your only issue is that some files aren't included in a repository's language summary, and you interpret this omission as some grave offence against the open source community.

Reading this thread has been a drug trip.

inoas commented 7 years ago

Please try to be professional. I am not claiming you are too lazy to add an Linguist Override either, I just assume you have good reasons for your arguments.

It is also no offence at the open source community but at all those who also make sure it strives. Regarding your comments as biased and personal from now on.

p.s.: I don't do any kind drugs, for that matter, as it seems you are curious to know.

arfon commented 7 years ago

Hey @inoas, thanks for opening this issue ❤️. I'm staff at GitHub and spend a fair amount of my time on Linguist.

Firstly, sorry for being slow to respond to this issue, it's been a busy week.

Secondly I wanted to offer a little background for some of the behaviour of Linguist with respect to the language bar. My understanding is that the primary purpose is to understand what (programming) languages are being used by a project and whether they may want to use/contribute to it.

Language detection (via Linguist) serves (at least) three main purposes on GitHub:

I think we're mostly discussing the second of these use cases here (repository identity).For me personally, the primary way in which I judge my initial interest in a project/repo is something like:

  1. The programming language the library/repo is written in
  2. The quality of the README/docs/API
  3. Whether it looks well maintained/has tests etc.

While I absolutely agree that documentation is a critical part of open source contribution, I personally wouldn't be particularly interested in understanding whether the repository used reStructuredText or Markdown for documentation. As I say though, that's just my personal preference but it was clearly also the preference of my predecessors on this project too.

There's some discussion on this thread about HTML and other markup languages that are included in repository stats. If you're willing, I'd encourage you to read the pull request that added reporting of these to the language bar.

As you'll see in this thread, there was concern that by including prose in repository statistics, very documentation-heavy repositories such as https://github.com/rails/rails could have their language statistics skewed very heavily towards Markdown (rather than Ruby) which we felt was somewhat misleading.

Anyway, I just wanted to give you this context to help you understand why the current decisions have been made. Truth be told I'm not entirely happy with the way that GitHub reports language breakdowns for a repository but I'm also not convinced that simply including prose in the language stats for a repository is the right fix.

inoas commented 7 years ago

Some questions ahead:

  1. The programming language the library/repo is written in
  2. The quality of the README/docs/API
  3. Whether it looks well maintained/has tests etc.

That's all fine. So stats on docs will tell me about 2. and 3. Stats on docs could also include docblocks, but that's another issue.

With the current setup I don't know wether a repo cares about documentation, nor - if there is a separate documentation repo - I have a quick grasp what technologies (programming languages, prosa) is being used. For cakephp/docs it looks like we write our docs in JavaScript thanks to the bias of Linguist.

Last but not least not only docs make open source work but also asset contribution. Those things can be quite technical and mathy (SVG is similar to TeX), so Why are vector files, especially if they are readable and human maintainable/create-able code, excluded?

Last but not least, why is it so important to see if a project uses 2% of this "programming language" and 1% of "that" however you cannot see if it uses 20% MarkDown at a glimpse. How about if there are multiple candidates for <5% to put them together in one accumulated "other" stat?

If you don't want to fix this at the statistics level, which I'd personally love to see, then I'd assume it would be nice to have different kinds categories (mark 1 to n on creation), mainly: Code, Docs, Assets, Data. By that it should be more easy to setup a "bias" that fits the respective repos better.

arfon commented 7 years ago

Would you care to elaborate why HTML and TeX are added but reStructuredText, Markdown are not?

Incase it's not clear, Linguist has evolved over time and there may well be some inconsistencies with some of the classifications. That said, I believe HTML is included as it's considered markup, and this includes all of the view template languages such as ERB.

TeX is a tricky one as I believe it's Turing complete so could in theory be defined as programming rather than the current markup. In my experience of seeing projects on GitHub that have LaTeX files in, they're often academic papers and so excluding TeX from the classifications of these projects would be less than ideal.

Isn't it easily possible to remove certain "file types" from the statistics bare by how Linguist works, already and thus wouldn't be a problem to opt-in to ignoring markdown?

It is. But it's not possible to override the inclusion of particular files/markups. i.e. it's currently possible to this:

but it's not possible to do something like:

Better support for overriding Linguist's classification behaviour is something I would definitely like to see us do.

thus wouldn't be a problem to opt-in to ignoring markdown?

Because of the reasons I gave in my earlier post I believe this wouldn't be the correct solution for most people. Instead, I'd rather have us allow people to opt-in to include Markdown.

Last but not least not only docs make open source work but also asset contribution. Those things can be quite technical and mathy (SVG is similar to TeX), so Why are vector files, especially if they are readable and human maintainable/create-able code, excluded?

This is a different question. We've discussed this on a number of occasions in other issues on this repository.

It seems like the default behaviour of Linguist on GitHub is not to your taste. I totally understand this but the GitHub community is very large and diverse and any change we make that would be more to your taste/preference would likely be a bad change for someone else. This is why I believe most of these issues are best addressed by support for full overrides of Linguist's defaults within the scope of your repository.

Finally, I should point out that this repository isn't monitored by GitHub's support team so if you'd like to express your desire for better Linguist overrides please contact support@github.com

inoas commented 7 years ago

TeX is a tricky one as I believe it's Turing complete so could in theory be defined as programming rather than the current markup. In my experience of seeing projects on GitHub that have LaTeX files in, they're often academic papers and so excluding TeX from the classifications of these projects would be less than ideal.

I do totally agree. Exactly this argument favours having markdown or re Structured-Text-only-repos reflecting in stats. So you are supporting my argument after all.

I'd also personally consider code-repos of better quality if the glimpse on stats reflects that they got a lot of documentation (and/or unit tests etc) vs "bare code". So the GitHub feature and Linguist could be of help here.

I do also not see the point why a code-sharing cloud platform should value academic papers over open-source docs. Both are equally important.

Isn't it easily possible to remove certain "file types" from the statistics bare by how Linguist works, already and thus wouldn't be a problem to opt-in to ignoring markdown?

It is. But it's not possible to override the inclusion of particular files/markups. i.e. it's currently possible to this:

  • Ignore all files in a particular folder/path

Exactly! So it would not hurt to add support for what you call prose (I am okay with calling it prose however the differences between TeX, HTML/CSS, XML/XSTL, Markdown, reStructured Text are not as distinctive as pronounced as @Alhadis tries to paint them, but more live at different places on one dimension/spectrum.

Ignore all files in a particular folder/path but it's not possible to do something like:

  • Choose to include files that are by default excluded (such as prose files). Better support for overriding Linguist's classification behaviour is something I would definitely like to see us do.

I'd love to see this happen. If you consider github not a tool for entirey open source projects but mostly/primarily for their turing-complete-programming-language sources, then it would be great to introduce repostory kinds/categories which switch around some things, as I suggested above: Code, Tests (say cucumber), Documentation/Papers/Specs, Assets, Data come to my mind. But you will have better in-house stats on that anyway.

This is not so much about my personal preference. If the staff at github tries to get in touch with the docs teams on FLOSS software and inquires what they would want to see in linguist statistics then you will have a (partial) objective foundation to argue off on top.

Thanks for taking care!

TomasHubelbauer commented 4 years ago

Is this decision to not include prose in the stats still the same three years later? I have just run into this issue and it is really disappointing to see that that can't be overridden. I completely agree that this should not be a default, but I don't understand why explicitly marking stuff as not documentation, like so:

*.md linguist-documentation=false

Is not enough to make Linguist count these files. Or even, if this is still deemed too risky (say you have code files in docs/ and want to see those but not MarkDown files in docs/), why not introduce:

*.md linguist-prose=false

Or

*.md linguist-classification=programming

It seems odd to me that this is just flat out not possible. I don't mind it not being easy, but I would have hoped it is made possible because sometimes it just makes sense. Sure, GitHub is virtually all code repositories, but not completely, there are art and writing repositories for which their authors would like to see the language bar breakdown.

TomasHubelbauer commented 4 years ago

Haha of course just as I finish typing that I find that this indeed has been made possible!

*.md linguist-detectable=true