laurent22 / joplin

Joplin - the privacy-focused note taking app with sync capabilities for Windows, macOS, Linux, Android and iOS.
https://joplinapp.org
Other
45.1k stars 4.9k forks source link

Spec: Export to Markdown with yaml front matter #5224

Closed CalebJohn closed 2 years ago

CalebJohn commented 3 years ago

The basic proposal is to support something like the pandoc yaml metadata block extension with the intention of providing a new export format with the utility of markdown export, but one that is less lossy. The JEX and RAW formats are already lossless, and don't need to be duplicated.

Metadata

The following properties are suggested to be included in the Markdown+YAML export

Additionally the folder structure will be preserved with folder names corresponding to (escaped) notebook names. Individual note names will be the escaped title. In instances of naming clashes, some random characters will be added to the name (same as the md export). The filetype will be used to set denote the markup type of a note (e.g. .md or .html). Internal links will be relative and point to the markdown file of a note (by name). Resources will be stored in a resources folder at the top level, when possible a proper name will be used. ID will be used as a fallback.

Notably, all ID's will be lost. Conflict status will also be lost. These are acceptable loses as the purpose of this format is not to provide a lossless export, but rather to have something that stores the information that might be relevant to a user.

Examples

---
Title: Frogs
Source: https://en.wikipedia.org/wiki/Frog
Created: 2021-05-01 16:40
Updated: 2021-05-01 16:40
Tags:
  - Reference
  - Cool
---

This article is about the group of amphibians. For other uses, see [Frog (disambiguation)](https://en.wikipedia.org/wiki/Frog_%28disambiguation%29 "Frog (disambiguation)").
...
---
Title: Take Home Quiz
Created: 2021-05-01 16:40
Updated: 2021-06-17 23:59
Tags:
  - School
  - Math
  - Homework
Completed?: No
Due: 2021-06-18 08:00
---

**Prove or give a counter-example of the following statement:**

> In three space dimensions and time, given an initial velocity field, there exists a vector velocity and a scalar pressure field, which are both smooth and globally defined, that solve the Navier–Stokes equations.

*Complete? accepts Yes/No or True/False

Examples of other apps

https://docs.zettlr.com/en/core/yaml-frontmatter/ https://www.11ty.dev/docs/data-frontmatter/ https://gohugo.io/content-management/front-matter/ https://jekyllrb.com/docs/front-matter/

edit (2021-09-27): Edited to demonstrate that the "Completed?" fields accepts Yes/No or True/False.

CalebJohn commented 3 years ago

If you have any feedback or suggestions feel free to comment and start a discussion. The above spec will be modified based on those discussions.

roman-r-m commented 3 years ago

I suppose internal links should be converted to use file names. Or full paths even.

Attachments should be exported too, with similar naming rules. There might be some that do not have a title, e.g. a photo taken directly from the app, in which case using the id is probably ok.

roman-r-m commented 3 years ago

If the user exports note A which links to note B, should only A be exported or both? Maybe it should be configurable.

laurent22 commented 3 years ago

I believe the current Markdown exporter preserves link to notes by linking to files instead. In fact would it make sense to create this new exporter as an extension of the current one? On the front end there will be two Markdown exporters but internally it would just one but with an option to toggle YML header on or off?

The filetype will be used to set denote the markup type of a note (e.g. .md or .html).

I wonder if we should keep all the files as Markdown regardless of markup type (since Markdown can embed HTML). Otherwise it might make things more complicated, in particular because the generated HTML won't be valid with the YML header.

tessus commented 3 years ago

Caleb, which format are you using to output long and lat?

Btw, do we have a concept of timezones in the internal storage of timedate? If you don't know, I can look it up. But if we do, we should add a timezone (e.g. as an offset). It always makes me a bit nervous, if I see a datetime without TZ. In this case it might not be that important, but I ran into all sorts of troubles for using the wrong timestamp because it was not clear what it actually was.

CalebJohn commented 3 years ago

which format are you using to output long and lat?

I don't know anything about lat/long formats, so I was just going to copy what is used in RAW exports. Example

latitude: 51.06050000
longitude: -114.11020000
altitude: 0.0000

do we have a concept of timezones in the internal storage of timedate?

I don't know, sorry.

But if we do, we should add a timezone (e.g. as an offset). It always makes me a bit nervous, if I see a datetime without TZ.

Even if we don't store a timezone, it probably wouldn't hurt to tack on the system timezone.

laurent22 commented 3 years ago

For the time format, as we don't store a timezone it would have to be UTC I think. And it would be better to use a standard format like ISO 8601.

tessus commented 3 years ago

For the time format, as we don't store a timezone it would have to be UTC I think.

My notes show my local time for updated and created. If we export that time without TZ as UTC, the time (and sometimes date) will be wrong.

Or do you mean the datetime in the metadata is UTC. I will check that. I have to run a few errands now, but I should get to it this evening.

roman-r-m commented 3 years ago

Or do you mean the datetime in the metadata is UTC

Looks like it is:

https://github.com/laurent22/joplin/blob/8b08f0d2b39e76694810991f5e2e32f4ad0287dc/packages/lib/models/BaseItem.ts#L299

laurent22 commented 3 years ago

Sure it's UTC in the metadata. Not sure how anything would work, especially sync, if it wasn't.

When it's displayed in the UI it's converted to local time, and when it's exported as RAW I believe it's ISO 8601.

tessus commented 3 years ago

Sure it's UTC in the metadata. Not sure how anything would work, especially sync, if it wasn't.

Yep, this is true. I had a slight brain freeze there.

richardsprague commented 3 years ago

One suggestion: for future compatibility, it's nice if you know the version of the exporter, in case someday you want to change the YAML. This can be as simple as a YAML tag like export-date that can be associated later with a version. If it's unwieldy to put that on each note, maybe each export can write details about itself to a file called _export.log or something.

Also, for internationalization do you need to specify UTF-8 or something to identify different character sets? (Might just be my ignorance here, so ignore this if it's irrelevant)

tessus commented 3 years ago

export-date is not that great for that. It doesn't hurt to have the export date in there, but it might not help with the version.

e.g. let's say you export on date X, but you haven't installed the latest Joplin version...

Not sure, if versioning is necessary, since it's not a real schema, but rather free text. However, if Caleb wants to use versioning, I'd rather use something like: Frontmatter version: 1

CalebJohn commented 3 years ago

@richardsprague that's a good suggestion and something we'll need to be mindful of. However, the primary purpose of this exporter is to provide a format that is friendly to non-technical users. The second goal is to provide an easy format for other note taking apps to ingest. The final goal is data stability, and it's not really a goal. If you're looking for a stable backup format, the JEX or RAW formats are what should be used. With this in mind I'm against adding an additional field that to most users will be techno-gibberish. Of course, I am open to discussion. The _export.log is a good compromise but I'm not sure it's necessary, I different (stable) export format would probably be better.

As for UTF-8, I'm not sure what the default behaviour is in Joplin. I believe it reads UTF-8. If other exporter/importers work for you, then you can expect this to work as well.

laurent22 commented 3 years ago

I agree that the goal of this format is to provide a user friendly export format, so adding computer properties is out of scope.

For the version number, it's not an issue - we probably will never need it but if we do we can assume that if the field is not present, it's version 1.

Encoding indeed should always be UTF-8, there's no need to support other encodings.

adamshand commented 3 years ago

It would be very helpful if the exporter also set the created and updated time stamp in the file system metadata (ie. ctime and mtine on Linux) as well. Some programs already use that when importing.

laurent22 commented 3 years ago

It would be very helpful if the exporter also set the created and updated time stamp in the file system metadata (ie. ctime and mtine on Linux) as well.

That might be personal preference but I don't like when applications do that. Suddenly I have some files from 2015 even thought that's not when they were created. I wonder if it can cause issues with certain backup programs too.

adamshand commented 3 years ago

Totally get that it's a preference, but in my mind the whole point is that the data was created in 2015. The point of the export is to represent the the data in Joplin in the most useful way possible, it doesn't matter the actual file was created 2 minutes ago because that's not useful information in this context.

It shouldn't cause any problems with backups. They are all new files so will be backed up because they don't exist on the destination.

CalebJohn commented 3 years ago

I've started working on this, so I thought I should give a brief update. Basically, I've been able to inherit a lot of behaviour from the current markdown export. I have updated the spec above to reflect that. The primary differences are that the spec now specifies that file naming conflicts will be resolved with a random character pattern (as opposed to using the id). Currently the resource naming behaviour is also inherited, but I'm going to attempt to improve on that. @laurent22 Is it okay if I update the resource naming behaviour for both the markdown export along with this PR? I think having friendly filenames (where possible) will only be beneficial for the markdown export and shouldn't harm anything.

laurent22 commented 3 years ago

@laurent22 Is it okay if I update the resource naming behaviour for both the markdown export along with this PR? I think having friendly filenames (where possible) will only be beneficial for the markdown export and shouldn't harm anything.

Yes I think that would be a good idea. I guess it could use the same logic as for the notes to avoid naming conflicts.

elsiehupp commented 3 years ago

I'm working on a project (privately for now) that I would like to be interoperable with Joplin, so standardizing Joplin's metadata headers would be enormously helpful.

It doesn't look like anyone has mentioned this, but a good metadata schema to use for Joplin notes could be the W3C Schema.org's NoteDigitalDocument, and for to-do's some subtype of Action. Schema.org has the advantage of being somewhat of a W3C standard, so it's a reliable "target" for developers. Schema.org also provides direct downloads as JSON-LD, which may be an aid to implementation.

Regarding JEX as a package format, I don't know if there's a better place to suggest this, but it would be nice if the package were to follow the macOS Document Package guidelines (including registering the format in Joplin.app's Info.plist) to a sufficient extent that macOS (and probably also iOS) would recognize JEX archives as being Joplin "documents". EDIT: I just made a separate issue on this topic.

Beyond this, it would be nice if JEX or something similar was more broadly a cross-application standard package format for "markdown bundles" (for lack of a better name), which could include one or more markdown files and any attachments within a single archive. This format could then be treated similarly to docx, odf, epub, pdf and the like by the operating system, e.g. it would be a document format rather than just a tar archive.

In addition to the aforementioned macOS Document Package compatibility, another related format that JEX could borrow from or interoperate with is rust-lang's mdbook. The main addition in mdbook is a table of contents so that the archive can be flattened into a single document for publication. While JEX isn't specifically intended for this purpose, exports could default to creating a table of contents that merely "bakes" the display order at that given time as a static hierarchy. This table of contents could be ignored by JEX imports, especially if it duplicates information in the YAML front matter, but including it could aid interoperability. Similarly, JEX could facilitate being parsed as CommonMark for Sphinx, but that might be somewhat more involved to implement...

elsiehupp commented 3 years ago

I just saw this disaster zone of a thread, and I should emphasize that I honestly don't care about how Joplin stores its notes internally; rather, my suggestions above are specific to JEX export and WebDAV sync (though the Schema.org schemas could be implemented more broadly). Basically I'd like to be able to have an application that is not Joplin be able to reliably sync/talk with an application that is Joplin, and standardizing the metadata headers in either case is an important part of doing so.

elsiehupp commented 2 years ago

@hyfree: yaml is a good format, though I prefer json.The json is easier to transfer and handle.

I agree that YAML can suffer from what I might call the “AppleScript problem” in that making it more human-readable can make it harder to properly format more complex data structures. At the same time the advantage of human-readability can outweigh the problems for simple, flat metadata. Like basically if the YAML headers are at the top they’re at least more readable than, say, email headers.

Again, it’s more important IMHO to make the headers work with other applications than it is to make the headers perfectly “just so”. For example, R Markdown seems to use YAML headers, so Joplin could try and work with that precedent (among others). There’s a tutorial for R Markdown YAML headers on GitHub here.

hyfree commented 2 years ago

@hyfree: yaml is a good format, though I prefer json.The json is easier to transfer and handle.

I agree that YAML can suffer from what I might call the “AppleScript problem” in that making it more human-readable can make it harder to properly format more complex data structures. At the same time the advantage of human-readability can outweigh the problems for simple, flat metadata. Like basically if the YAML headers are at the top they’re at least more readable than, say, email headers.

Again, it’s more important IMHO to make the headers work with other applications than it is to make the headers perfectly “just so”. For example, R Markdown seems to use YAML headers, so Joplin could try and work with that precedent (among others). There’s a tutorial for R Markdown YAML headers on GitHub here.

thanks I agree with you, YAML is a good format, and I am looking forward to the development of joplin.

CalebJohn commented 2 years ago

I'm working on a project (privately for now) that I would like to be interoperable with Joplin, so standardizing Joplin's metadata headers would be enormously helpful.

Thanks for the comments @elsiehupp, my apologies for not replying earlier. I read through this when you first commented, but I was in full on implementation mode and forgot to leave a reply.

It doesn't look like anyone has mentioned this, but a good metadata schema to use for Joplin notes could be the W3C Schema.org's NoteDigitalDocument, and for to-do's some subtype of Action. Schema.org has the advantage of being somewhat of a W3C standard, so it's a reliable "target" for developers. Schema.org also provides direct downloads as JSON-LD, which may be an aid to implementation.

It's certainly interesting! I think it would be great to have an exporter/importer that deals with that schema, but for this one the goal is really to have something as simple/approachable as possible.

I looked in to the multimarkdown and r-markdown formats and they look similar enough that we can support them (limited) for import!

huyz commented 2 years ago

@CalebJohn Thanks, looking forward to this. Even a dev version whenever you have one.

CalebJohn commented 2 years ago

@huyz The dev version is available here. The exporter/importer is essentially done and just awaiting final code review.

huyz commented 2 years ago

The dev version is available here. The exporter/importer is essentially done and just awaiting final code review.

Thanks for that. I'm running it. Could we have the timezone in the dates?

laurent22 commented 2 years ago

That's a good point. We don't store the time zone so we can't display that, but I think the date shouldn't be in "local time" when saving the MD file because we can't know what exact time this is. @CalebJohn shouldn't the time be saved as UTC instead? Probably ISO 8601 2007-03-01T13:00:00Z.

huyz commented 2 years ago

The dates are stored in timezone-naive format in the DB? So I guess we have to rely on the assumption that all the timestamps are in local time, right? In that case, couldn't the timezone be figured out from the environment and just appended to the text, e.g. -08:00?

I figure you need to make that same assumption to be able to convert a timezone-naive datetime to UTC anyway.

Btw, I prefer ISO 8601 timestamp with a space instead of a T--both are valid but one is easier to read :)

laurent22 commented 2 years ago

Yes if it's valid ISO without the T we can do that. But it needs the Z to specify UC.

huyz commented 2 years ago

@CalebJohn A minor request but could the keys in the frontmatter be lowercased? I know there's no standard, but it seems to be more common to have lowercase, which makes plugins in other note tools work better out of the box when updated and created match up.

huyz commented 2 years ago

@CalebJohn Also, if we have:

latitude: 0.00000000
longitude: 0.00000000
altitude: 0.0000

I would rather the info be completely omitted since it's obviously wrong and takes up space.

CalebJohn commented 2 years ago

That's a good point. We don't store the time zone so we can't display that, but I think the date shouldn't be in "local time" when saving the MD file because we can't know what exact time this is. @CalebJohn shouldn't the time be saved as UTC instead? Probably ISO 8601 2007-03-01T13:00:00Z.

I thought about this while implementing it and ultimately decided to output local time without time zone. My thinking was that these are user times, not system times. If I change the user_created_time to be 11pm December 31st and export a note, and then fly to London before importing. I don't know if they'll export to import a note with user_created_time of 7am January 1st the following year. Although I can see the other side as well, once imported back into Joplin it makes sense for the dates to be identical. The current implementation uses the users preferred date/time format, so changing this will be a loss for readability, but a win for stability. I'll make the change and if we decide against it, it will be easy to revert.

With that said, the importer handles a variety of date formats including formats that include a timezone. This means that users are free to specify a timezone and the note will import exactly as expected. I will add a test to ensure this remains true.

Btw, I prefer ISO 8601 timestamp with a space instead of a T--both are valid but one is easier to read :) Yes if it's valid ISO without the T we can do that. But it needs the Z to specify UC.

The space isn't valid ISO 8601, but its is valid with RFC 3339. Which is probably what matters more.

A minor request but could the keys in the frontmatter be lowercased? I know there's no standard, but it seems to be more common to have lowercase, which makes plugins in other note tools work better out of the box when updated and created match up.

The purpose of this exporter is skewed more towards user friendliness and less towards interop with other services. From what I've seen, most other services don't even use updated/created. Title is very common though, so it could be potentially worthwhile to change this. Specifically I tested with pandoc, and it doesn't like the capitalization. I'll make the update and if @laurent22 disagrees it will be easy to revert.

I would rather the info be completely omitted since it's obviously wrong and takes up space.

Thanks, I had originally intended to do it that way (see the spec above), but I forgot at some point and it got missed. It's corrected now.

huyz commented 2 years ago

I thought about this while implementing it and ultimately decided to output local time without time zone.

You know what, actually, I think your solution might make the most sense. In my case, it turned out to be helpful that there was no timezone. Because that told me to look closer and to learn that Joplin uses timezone-naive "user times". So this spurred me to write a post-processing perl script to add the timezone based on the dates and what I remember about my travels, which would be a lot more accurate than just putting a misleading local timezone. Perhaps, a corresponding explanation right before the export process would be appropriate. Even perhaps giving the user the option to leave the timezone off or to assume the current local timezone from the environment.

The space isn't valid ISO 8601, but its is valid with RFC 3339. Which is probably what matters more.

Oh yeah, not since 8601-1:2019. TIL