jbenet / depviz

dependency visualizer for the web
https://jbenet.github.io/depviz
MIT License
49 stars 10 forks source link

Find/render related nodes #22

Open wking opened 7 years ago

wking commented 7 years ago

We don't do this at the moment (there's a FIXME in the GitHub module). Screenshot for the designed display in #9. Related but (I think?) distinct idea in #15.

rht commented 7 years ago

Hmmm...Looks like this hasn't been spec-ed / exists only in the viz example. There are two means to put this:

  1. do a regex match on the remaining issues that hasn't been matched by the 'depends on: ...' syntax (e.g. '#9' and '#15' in https://github.com/jbenet/depviz/issues/22#issue-192108780)
  2. vectorize (be it semantic or not) the content of each issue, then construct a similarity matrix. This could be used for issue dedup as well. An existing example I have seen is the related question in SO when posting for a new question (looks like it only matches the question's title instead of the bodies).

(1) can be added directly, but (2) requires either lunrjs for the code to still be able to run on gh-pages or having a full-blown server-side indexing/vectorization.

wking commented 7 years ago

On Mon, Jan 09, 2017 at 05:46:17AM -0800, rht wrote:

  1. do a regex match on the remaining issues that hasn't been matched by the 'depends on: ...' syntax (e.g. '#9' and '#15' in https://github.com/jbenet/depviz/issues/22#issue-192108780)

I think we want a regexp looking for ‘related to: …’ syntax, because that will let you declare relations consistently regardless of whether the relative is on GitHub or not.

  1. vectorize (be it semantic or not) the content of each issue, then construct a similarity matrix. This could be used for issue dedup as well. An existing example I have seen is the related question in SO when posting for a new question (looks like it only matches the question's title instead of the bodies).

I'd rather have these be explicitly declared (with ‘related to: …’), since that avoids the need to define matching heuristics. And I'm not sure how often related issue share a lot of similar strings. “Related” is different from “duplicated”.

The reason I've put off related edges so far is that they're undirected, so you'd either have to document them on each side (in issue A: ‘related to: #B’, and in issue B ‘related to: #A’) or have a way to discover backreferences. See #25 about the difficulties of backreference discovery.

rht commented 7 years ago

I think we want a regexp looking for ‘related to: …’ syntax, because that will let you declare relations consistently regardless of whether the relative is on GitHub or not.

For such purpose, an explicit syntax shouldn't be required. The regexp can be augmented to parse gitlab / mailing list thread / atlassian urls.

I'd rather have these be explicitly declared (with ‘related to: …’), since that avoids the need to define matching heuristics.

The matching heuristics is useful for discovery since a human annotator wouldn't be able to constantly comb through the issues (or recall all possibly related past issues) for such.

And I'm not sure how often related issue share a lot of similar strings.

They should both refer to specific objects, vars, error messages, etc. The description should be sufficiently regular, there have been libs used to detect duplicated code.

The reason I've put off related edges so far is that they're undirected, so you'd either have to document them on each side (in issue A: ‘related to: #B’, and in issue B ‘related to: #A’) or have a way to discover backreferences.

This could be done incrementally. Backref within github has been nailed in https://github.com/jbenet/depviz/issues/25#issuecomment-271389074. Cross-linking the issues of github and gitlab would be a separate undertaking, and more doable starting from the gitlab end (https://gitlab.com/search?utf8=%E2%9C%93&search=create_cross_references&group_id=&project_id=13083&search_code=true&repository_ref=bb02141e417ff21deb7707a806a313545bbdd5af).

wking commented 7 years ago

On Tue, Jan 10, 2017 at 02:19:29AM -0800, rht wrote:

I think we want a regexp looking for ‘related to: …’ syntax, because that will let you declare relations consistently regardless of whether the relative is on GitHub or not.

For such purpose, an explicit syntax shouldn't be required. The regexp can be augmented to parse gitlab / mailing list thread / atlassian urls.

Fair enough, that makes the regexp more complicated, but it would be workable. However…

I'd rather have these be explicitly declared (with ‘related to: …’), since that avoids the need to define matching heuristics.

The matching heuristics is useful for discovery since a human annotator wouldn't be able to constantly comb through the issues (or recall all possibly related past issues) for such.

Maybe a separate tool to apply this heuristic and suggest ‘related to: …’ annotations for the annotator to consider? For example, 1 links to #45, but the connection between #45 and #59 is mostly for historical interest and not something where I think an edge in the issue graph would help organize future work. Using ‘related to: …’ lets you curate your edges, and folks who are comfortable with an automated heuristic can run:

$ your-heuristic-related-to-injector jbenet/depviz

(or whatever) to add them to their project.