hasgeek / funnel

Website for hasgeek.com
https://hasgeek.com/
GNU Affero General Public License v3.0
46 stars 52 forks source link

markdown-it-py plugin for supporting allow-listed embeds #1498

Open miteshashar opened 2 years ago

miteshashar commented 2 years ago

Naked URLs in a paragraph by itself shall be rendered as an embed when they are in the allowlist maintained in the backend.

Depending on the source, embeds may be regular HTML or oEmbed-based.

The final embed code for oEmbed-based needs to be cached on the backend.

miteshashar commented 2 years ago

@jace When you specified that the allowlist needs to be maintained in the backend, did you mean it to be maintained in the database to be managed through the application? Or are we going to maintain this in code configurations?

jace commented 2 years ago

Code, as each provider has to be manually validated. No new database models are being introduced here.

miteshashar commented 2 years ago

What are the providers we plan to support in the first cut?

jace commented 2 years ago

In decreasing order of priority for each category:

Videos: YouTube, Vimeo, Twitch Docs: PDF, GitHub Gist, Google Drive, AirTable, Office 365 Social: Twitter, Instagram, LinkedIn, Facebook

miteshashar commented 1 year ago

Based on this chat, it was determined that we are not handling this in markdown at the moment. Further, it was also discussed that handling AST/token generation should still be in models, since it is a one-time process on save. The post-processing is a separate process and we how we handle it can be revisited after the under-mentioned restructuring of architecture.

Mitesh Ashar, [15-Dec-2022 at 10:03:31 PM (15-Dec-2022 at 10:04:22 PM)]: For embeds:

Kiran Jonnalagadda, [15-Dec-2022 at 10:47:49 PM]: Yes, but oEmbed should be deferred to a background job. It can't block the request with new network requests. This will add a major complication to the current composite class as it has no context.

This means the markdown parser only generates a list of URLs to process in the first pass, somehow hands it off to a background job, and then the background job collects the embeds and reruns the markdown processor to find the places to insert the content. Or it does a simpler find and replace for <p>url</p>.

But with two rounds of hand-off, we've hit the limit of what a composite class can do. For this we have to replace the composite constructor itself. It'll require some digging into SQLAlchemy architecture.

One potential way is for the composite to not actually process markdown. Instead we could define a new data type for Markdown and the composite just depends on it. I'm assuming the SQLA type system will give us better access to the context, but again needs homework.

Yet another case: defer it all the view, which does the find-and-replace-from-cache routine at runtime.

Mitesh Ashar, [15-Dec-2022 at 11:40:44 PM (15-Dec-2022 at 11:43:16 PM)]: I remember for twitter you mentioned that we make the user enter

```{twitter} url
Tweet text
.```

How do we intend to process this?

So, we are only going to process urls that are in a line having only that one url and going to ignore other urls that are Inline.

Mitesh Ashar, [16-Dec-2022 at 1:06:14 AM (16-Dec-2022 at 1:18:16 AM)]: There are numerous api calls for preview during an edit. Why not call and cache at the time of previews? If cache hit, deliver embed code to preview. If cache miss, check if already queued. If not queued, queue it. Deliver only the url. Doing this can have embed codes for more than atleast 95% urls already in cache. The further question is going to be how to process the remaining.

Challenge: All of this happens inside the mdit plugin. markdown() is agnostic to this. It just delivers the output that the set of rules current profile's mdit instance has. So, it doesn't know this inside working of the plugin.

A potential solution: We don't let markdown() stay agnostic. We pass a preview flag to it which passes on to the plugin through an appropriate way. If a call is not a preview call and there are any links still in the content that do not have a cached link, it schedules a one time process to reparse and save. The plugin will continue to schedule this until all links have either been processed or are already known to fail, since they have either the embed code or a failure message in the cache.

With this process in mind, my query about cache life applies here too, because this conversion is only required for parsing. I was thinking that having a life of 14 days for drafts and current content should be fine. If old content is being edited, this should be 48 hours.

Kiran Jonnalagadda, [16-Dec-2022 at 7:55:08 AM]: A network fetch is a view layer activity. It can't be done in utils. However, the parsing is happening in models, which definitely can't reach into the network. There's an architecture mess here that's larger than one Markdown plugin.

IMO, park oEmbed entirely for now until we solve this. JS embed can continue for simple iframe embeds where no API calls are needed.

The thing that's emerging here is that Markdown processing has to be bumped over from model to view, so it's the end of the road for the composite column approach. The view also has to do this in two passes: inline processing produces an AST/token stream, and then a background job updates it with fetched content. But the background job has to know where to save it.

I think we should park this entirely until we get through one of the two major re-org tasks: centralising content in the Document model, and making a mono repo with Docker-based deploys. Both of these will anyway force us to rework the content processing flow.

miteshashar commented 1 year ago

At the current stage, we are only handling this on the client side for sources, for which it is possible to do so.