github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.11k stars 4.19k forks source link

LInguist is reporting my project as a Jupyter Notebook #3316

Closed adam704a closed 6 years ago

adam704a commented 7 years ago

As you can see, I have some notebooks, but mostly this is a python project.

https://github.com/ICTatRTI/researchnet

Did I do something wrong?

TotalVerb commented 7 years ago

Jupyter notebooks have an inflated number of lines of code, since they store a lot of metadata. So it doesn't take many notebooks to "take over" a project.

Alhadis commented 7 years ago

Does anybody actually write these files out by hand? Because it sounds like they're generated primarily from a webapp:

The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

And if that's the case, well, I'd say these generated files should be marked as exactly that: generated.

/cc @pchaigno /resident Python-guy

TotalVerb commented 7 years ago

Whatever action is taken, it would be best to maintain the searchability and identifiability of notebook-only repositories.

TotalVerb commented 7 years ago

Possibly the best course of action is to modify the lines of code reported into an "equivalent lines of code" measure which takes into account the unavoidable boilerplate. For instance, the source line consisting of the single character π may turn into this monstrosity in the .ipynb file:

  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "π = 3.1415926535897..."
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "π"
   ]
  },
Alhadis commented 7 years ago

That thing is about as long as the value's floating-point component itself.

All we'd need to mark these things as generated is to match against a common pattern that's consistently used in webapp-created notebooks. Usually it's something like Generated by AppName 1.1.1.1.1.1-betasemverasfuck0 or what-have-you.

TotalVerb commented 7 years ago

You could maybe match against

 "metadata": {
   // [ stuff in here varies ]
 },
 "nbformat": 4,
 "nbformat_minor": 1

but wouldn't this make notebook-only repositories classify incorrectly?

Alhadis commented 7 years ago

Marking them as generated simply omits them from the language-statistics bar. We already have a number of generated-file detection routines that filter files that would otherwise unfairly skew a repository's stats. Here's the logic for detecting generated PostScript, for example. You can imagine how many projects would be incorrectly classified as PostScript if we left every .eps file unchecked.

And while that snippet you've posted might work, it should ideally be 100% unambiguous. E.g., leave no room for misidentification. The existing rules which test against single-line patterns are all very specific:

Et cetera.

TotalVerb commented 7 years ago

The difference between postscript and Jupyter is that all Jupyter notebooks are "generated", though (either by the web app or by IPython's CLI). And unlike postscript, human effort generally needs to go into every cell of a Jupyter notebook; it's just that each cell ends up taking a lot of lines of code.

Here are some empty, newly-created notebooks with Julia and Python kernels.

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Julia 0.5.0",
   "language": "julia",
   "name": "julia-0.5"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "0.5.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}

and

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
TotalVerb commented 7 years ago

For what it's worth, I think personally that a good solution would be to estimate how many lines of a Jupyter notebook are "source" and how many are "generated". Source lines (which are all written by a human) generally look like this:

   "source": [
    "import Base: +\n",
    "\n",
    "+{T<:Number}(x::DualNumber{T}, y::DualNumber{T}) = DualNumber{T}(x.re + y.re, x.ep + y.ep)\n",
    "\n",
    "DualNumber(10.0, 17.0) + DualNumber(5.0, 9.0)"
   ]

Can linguist already handle partial file identifications?

soniclavier commented 7 years ago

I am having the same problem, I have about 3-4 .ipynb files out of 144 files(mainly java and scala) in my repo. If there is any option to make Linguist report based on count of files rather than size, it would be helpful.

For now, I added *.ipynb linguist-vendored to my .gitattributes file in my repository.

caged commented 7 years ago

:wave: It looks like the original repo is no longer classified as an ipython notebook and I don't see a .gitattributes file in the repo. Can someone clarify if this is still an issue?

lildude commented 6 years ago

As @Caged mentioned, things appear to be working now on the original repo. As there hasn't been an update since 3 May, I'm closing this on the basis this has been resolved.

pierluigiferrari commented 6 years ago

@lildude, @Caged I can confirm that things are not working regarding Jupyter notebooks. It's still the same issue as before: A Jupyter notebook consists of Python code that the author wrote, and of generated code that makes it an interactive environment that can be displayed in a web browser. The generated code usually makes up a lot more lines than the Python code that the author wrote.

The first problem here is that for the purpose of what linguist is trying to achieve (i.e. a breakdown of the programming languages the author used in the repo) "Jupyter notebook" should not be considered a language at all. For all intents and purposes it's just a container that holds Python code.

The second problem is that simply ingoring Jupyter notebooks from the statistics also ignores all of the actually relevant Python code inside them.

lildude commented 6 years ago

Thanks for confirming this and for the explanation @pierluigiferrari. Now I have a better understanding having looked into it, and given your two points, I don't think this is something that can easily, if ever, be addressed automatically.

The biggest limiting factor that I can see is the fact the Jupyter notebooks combine written and generated language within the same file. Linguist doesn't support partial file classification and isn't likely to ever do so as I'd imagine this would be incredibly resource intensive and probably highly unreliable when it comes to even attempting to differentiate between human and computer written code within the same file. Our current classifier is already hugely inefficient as it is.

The next limiting factor is preference. Some want the Jupyter note books recognised for what they are, others prefer them to be identified by the language they're written themselves and others still don't want the files counted at all.

I think our current implementation (implemented in https://github.com/github/linguist/pull/2746 via https://github.com/github/linguist/pull/2763) combined with manual overrides is probably the best compromise for all.

Jupyter note books are also far too prevalent on GitHub to change the default behaviour without major backlash.

Alhadis commented 6 years ago

Linguist doesn't support partial file classification and isn't likely to ever do so as I'd imagine this would be incredibly resource intensive and probably highly unreliable when it comes to even attempting to differentiate between human and computer written code within the same file.

... which is where an idea of mine may hold the answer. ;) I regurgitated sleep-deprived explanations which, through weighting averages assigned to specific scopes, could yield a more rational Python Notebook usage. E.g., the number of lines the programmer actually did pen of their own hand.

pierluigiferrari commented 6 years ago

@lildude I understand. As you said, it seems like the best solution for Jupyter notebook users is to use manual override. Thanks for clarifying why it is the way it is and why it will likely remain this way!

Borda commented 5 years ago

@lildude I understand. As you said, it seems like the best solution for Jupyter notebook users is to use manual override. Thanks for clarifying why it is the way it is and why it will likely remain this way!

What does it mean the manual override for language statistic on GitHub? Is it the .gitatributes file? In my opinion, it would be fair counting if for the ipynb lines will be counted only the source lines, not all metadata as well as all generated outputs...

pchaigno commented 5 years ago

@Borda Please see the last paragraph of how Linguist works and Linguist overrides.

Darel13712 commented 2 years ago

@lildude Current way of counting Jupyter notebooks as a whole file instead of just code is very confusing. This way you may have a single notebook that contains more lines than your whole python library.

Can you elaborate on why linguist can't count only human generated lines in jupyter notebooks? As I understand we need to take source property for cells that have cell_type set to code.

Also there is jupyter nbconvert --to script notebook.ipynb to get only code. Language type can still be "Jupyter Notebook", but counting only code lines without the output is critical in my opinion.

Alhadis commented 2 years ago

Can you elaborate on why linguist can't count only human generated lines in jupyter notebooks?

Linguist either considers a file's contents for analysis, or it doesn't. It doesn't support piecemeal analysis of different file regions; doing so would undoubtedly take a serious toll on GitHub's servers, which are already taxed by Linguist as it is.

Darel13712 commented 2 years ago

Linguist either considers a file's contents for analysis, or it doesn't.

Honestly this doesn't make things clearer or explain why json parsing which can be done with one file read would take a serious toll...

Alhadis commented 2 years ago

Realistically, the impact to performance wouldn't matter if Linguist were tasked with analysing one or two repositories here and there. But Linguist is responsible for scanning millions of repositories, at every hour of the day, every time somebody pushes a change to their project. Scalability matters big-time here.

Darel13712 commented 2 years ago

But you said that Linguist can consider file contents for analysis, so it's already a thing?

Alhadis commented 2 years ago

Yes, but it only weighs a file's contribution to language usage in terms of bytes. It doesn't stop to process what those bytes might contain (whitespace, comments, generated boilerplate) in order to differentiate from the more "important" segments.