Suggestion: leverage data formats

Beanow commented 4 years ago

Noticing some parsing of the txt file is happening: https://github.com/rseng/good-first-issues/blob/266448bc9d061e1e9a6326f4f6698c98f42b5b6c/scripts/generate-first-issues.py#L39-L41

JSON/YAML could probably make this easier and self-documenting:

# json style
[
  {
    "owner": "organization",
    "name": "example-repo",
    "tags": ["hello", "world"]
  }
]

# yaml style
- owner: organization
  name: example-repo
  tags: [hello, world]

Likewise, I think it would be really useful to have an option to produce data output. Rather than directly going to front matter + markdown. Even if just internally in the python scripts.

Here's some pseudo code of what I mean in terms of interfaces.

type Filename = string;
type MarkdownContent = string;

interface ScriptFunctions {
  fetchRepos(path: Filename): Repository[];
  fetchIssues(repos: Repository[]): Map<Repository, Issue[]>;
  templateIssues(data: Map<Repository, Issue[]>): Map<Filename, MarkdownContent>;
  writeMarkdownFiles(files: Map<Filename, MarkdownContent>);
}

This would make it trivial to support writing JSON data out instead of MD files :]

vsoch commented 4 years ago

The audience I chose the input "format" for is academic and data science groups, and a lot of them have preference for formats like text / csv, even over json (and definitely yaml), so based on the fact that people will most readily copy paste a GitHub url (including organization and repository name) and then we just need a list of tags after that, a single url then comma delimited tags is a simple and logical solution. I think the formats you shared would be reasonable if there were more needed than that.

What kind of use cases do you have in mind for just data output?

Beanow commented 4 years ago

Writing yaml is already a requirement for setting up the GH action. https://github.com/rseng/good-first-issues#example-usage

Copy pasting URLs rather than pulling them apart is a fair point. I'm interpreting that as a deliberate choice to prioritize simple to update, over self-documenting.

You could have that same priority with something very close to the custom txt you're using, while still being valid json/yml.

# yaml style, closest to txt format
https://github.com/organization/example-repo: [hello, world]

# json style
{
  "https://github.com/organization/example-repo": ["hello", "world"]
}

# yaml alternative, with some empty lines would be nice to read
https://github.com/organization/example-repo:
  - hello
  - world

What kind of use cases do you have in mind for just data output?

Raw data in a well-supported and simple format like JSON, is by far the best interoperability choice you could make :smile:

Take the other issue: https://github.com/rseng/good-first-issues/issues/3 Q: Why not just scrape the HTML to get this data, instead of wait for an API? A: Because HTML (and XML for that matter) is a horrid and unwieldy data format. In fact, it's unfair to call it a data format, because it's a markup language. Same as markdown.

Having the scores.json format, allowed me to create SourceCred widgets before I had a clue about how any of the SourceCred code worked or what sort of APIs or tooling it had.

Likewise, I think giving the option to output JSON, allows people to use this action as input for whatever else they can come up with. Maybe they would like to load the json into some javascript browser tool rather than static-rendering. Maybe they'll load it into their own script and add bounties / cred scores for these issues. Maybe they want to store it in a database.

Whatever the use-case is, having "just data" will make it significantly easier to do. That said, I don't myself have a use-case right now. So I would suggest this more as a best practice than a feature request. :slightly_smiling_face:

vsoch commented 4 years ago

Likewise, I think giving the option to output JSON, allows people to use this action as input for whatever else they can come up with. Maybe they would like to load the json into some javascript browser tool rather than static-rendering. Maybe they'll load it into their own script and add bounties / cred scores for these issues. Maybe they want to store it in a database.

I am in total support! I would suggest that we wait for someone actually asking for this, before designing something that isn't asked for.

rseng / good-first-issues

Suggestion: leverage data formats #2