sonatype-nexus-community / oysteR

Create purls from the filtered sands of your dependencies, powered by OSS Index
https://sonatype-nexus-community.github.io/oysteR/
Apache License 2.0
40 stars 9 forks source link

Add other R package sources #18

Open JosiahParry opened 4 years ago

JosiahParry commented 4 years ago

The internal function get_purls assumes that all packages come from CRAN when this is not the case.

Alternatively, this package could use {renv}—which records package source effectively—to identify package source.

https://github.com/sonatype-nexus-community/oysteR/blob/df9e27894368e46ffdd20859828ca13ae68713eb/R/audit_deps.R#L35

csgillespie commented 4 years ago

Thanks for your comment. I'm pinging @DarthHater to double check my response.

JosiahParry commented 4 years ago

Ah, that would make sense. Perhaps, then, it would make sense to check the package against tools::cran_package_db() to make sure there is an NA when a package is not on CRAN.

On Mon, Jul 27, 2020 at 09:43 Colin Gillespie notifications@github.com wrote:

Thanks for your comment. I'm pinging @DarthHater https://github.com/DarthHater to double check my response.

  • Currently the only packages that sonatype check are on CRAN
  • If the package was installed via the RStudio package manager (say), then the pkg_name and version would still be checked against the sonatype CRAN database. In which case, it either exists or doesn't.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sonatype-nexus-community/oysteR/issues/18#issuecomment-664404584, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADHIKLARQXK6IZNGIALVMNDR5V77LANCNFSM4PIYZN5A .

csgillespie commented 4 years ago

@JosiahParry I didn't know about (forgotten?) about tools::cran_package_db() I suppose one issue is a package has been removed from CRAN due to security (or another issue). The package would still be in the sonatype DB, but not on CRAN.

DarthHater commented 4 years ago

Right now, we only really index CRAN for OSS Index, so if a package came from somewhere else, we wouldn't really know about it! What other indexes/registries/etc... exist for R? @JosiahParry if you have some you know of, I'm pinging @brittanybelle and @ken-duck who might be interested!

DarthHater commented 4 years ago

Also love the note on the package source. If we can get some more sources in OSS Index, would love a PR on the the package source stuff!

csgillespie commented 4 years ago

The next most obvious (and very important repo) is Bioconductor (https://www.bioconductor.org/). This is used for life-science experiments, e.g. Pharma, governments, Universities.

JosiahParry commented 4 years ago

@DarthHater, I just learned about sonatype today so I'll have to understand how it works first! Off the top of my noggin of course is https://www.bioconductor.org/ (but I've never installed a package from there) and git hosted repos.

csgillespie commented 4 years ago

@JosiahParry Beat you by 20 seconds ;)

DarthHater commented 4 years ago

git hosted repos are probably something we can't handle super easily (at the moment anyways), but bioconductor probably! The reason on the git hosted repos is because we'd have to ingest more or less every git repo that exists and good luck finding out all that exist, hah :)

JosiahParry commented 4 years ago

@DarthHater would it be possible to have it ingest on request? Or is it sort of an all or nothing kind of thing?

DarthHater commented 4 years ago

@JosiahParry I don't know the answer to that, but something we could likely look at (maybe) is we get a list of coordinates when someone makes a request, and if we could identify the git sources differently, then potentially there is some process is like "Well people are asking about this, maybe we should go get info on it". That doesn't exist today, but it's certainly something to think about!

csgillespie commented 4 years ago

@Darkvar @JosiahParry Adding in "trusted organisations" as a halfway house would be a good start and probably cover 95% of usage. For example,

But bioconductor should be first on the list

JosiahParry commented 4 years ago

That would be a good idea. I'd probably put r-lib (https://github.com/r-lib) and r-dbi (https://github.com/r-dbi) on that list—but that's based on my knowledge and exposure of course!

@DarthHater and @csgillespie and recommendations for getting started on understanding the OSS index API? I'll definitely have to get an understanding of "coordinates" and the index itself.

DarthHater commented 4 years ago

We use what's called a purl to identify components across all ecosystems. There's so much variability that a few people came together and came up with a spec to hopefully identify things in a "easy" way. Every tool we work on implements this, you could take a gander at: https://github.com/package-url/purl-spec

DarthHater commented 4 years ago

The API itself we use to query is a POST and just takes a list of purls in a coordinates array IIRC. You can post 128 at a time, and there is some rate limiting, which is in place mostly to keep people from scraping it.

JosiahParry commented 4 years ago

@DarthHater Wonderful I'll give that a perusal! I appreciate it.

JosiahParry commented 4 years ago

So it's absolutely possible to use the API on a github hosted package. I'm not sure what it is actually doing with the repo, but the API call is viable.

The downside with this approach is that it requires the package to be downloaded to fetch the GH sha.

# get github sha and make purl for josiahparry/genius
pkg <- "genius"
pkg_d <- utils::packageDescription("genius")
gh_purl <- glue::glue("pkg:github/{pkg_d$RemoteUsername}/{pkg_d$RemoteRepo}@{pkg_d$RemoteSha}")

# call the API
oysteR:::call_oss_index(list(gh_purl), TRUE)

A more effective approach would be to use a method similar to what remotes does an actually query the GH API which will provide the sha.

DarthHater commented 4 years ago

The API will accept it, but what I was getting at is we don't have any data for those yet, so you can send it but nothing will come back more or less :)

brittanybelle commented 4 years ago

Great discussion here, thanks for bringing the extra awareness from the R community about different package sources! I'll make sure this stays on our radar. :)

csgillespie commented 4 years ago

For future readers. It would be great is Sonatype also include packages from:

JosiahParry commented 3 years ago

After more thought I think that this issue should be focused solely on BioC. Most R packages that will see wide use will be, at some point, published to either CRAN or BioConductor. Having GitHub as a source, for example, over CRAN would only make sense if the package will never be hosted on CRAN. If a package is not accepted to CRAN there is typically good reason.

I also do not completely know how the OSS index works. For example, what are the sources? If the source aggregates from things like CVE Details (R CVE) then I'm not sure differentiating actually provides any utility.