Google drive document handler

chime3 commented 3 months ago

Closes #85

What's changed

Get file list from google drive based on mimeType for pdf and google documents(doc, spreadsheet etc )

q: "mimeType='application/vnd.google-apps.document' or mimeType='application/vnd.google-apps.spreadsheet' or mimeType='application/vnd.google-apps.presentation' or mimeType='application/pdf' or mimeType='application/vnd.openxmlformats-officedocument.wordprocessingml.document'"

Added DocumentType matching mimeType
Parse pdf and google document only (excluding spreadsheet and presentation)

How to test these changes

Create at least 3 google documents and run test yarn run test tests/provider/google/gdrive-document.tests.ts

What's pending

export enum DocumentType {
    TXT = "txt", // Done
    PDF = "pdf", // Done
    DOC = "doc",
    DOCX = "docx",
    XLS = "xls",
    XLSX = "xlsx",
    PPT = "ppt",
    PPTX = "pptx",
    OTHER = "other" // zip, mp4????? 
}

We can export google spreadsheet and presentation as pdf or other corresponding types like ppt, xls etc. Do you want to process them as pdf to parse?
doc, xls, ppt etc are non-Google document. Do you want to parse them as well?
There are many non-Google document in addition to DocumentType. Are you planning to extend supported file types?
Should we list non-document files in G Drive like zip, mp4 as well? I mean OTHER type will include zip, mp4 etc?

tahpot commented 2 months ago

Instead of downloading the files and parsing them, is it possible to fetch the indexableText for each file? See https://developers.google.com/drive/api/guides/file#indexable-text

According to ChatGPT:

To fetch the indexable text property of a Google Drive file using the Google Drive API, you'll need to use the files.get method with the fields parameter specifying the indexableText field.

We don't want to store the files, we are just extracting text from documents that have text content.

We can export google spreadsheet and presentation as pdf or other corresponding types like ppt, xls etc. Do you want to process them as pdf to parse?

Let's not parse spreadsheets. Convert presentations to PDF, then parse them.

doc, xls, ppt etc are non-Google document. Do you want to parse them as well?

Yes to parsing .doc and .ppt.

There are many non-Google document in addition to DocumentType. Are you planning to extend supported file types?

Not at this stage.

Should we list non-document files in G Drive like zip, mp4 as well? I mean OTHER type will include zip, mp4 etc?

I think we should list them and include a link to them, but not parse them.

chime3 commented 2 months ago

Instead of downloading the files and parsing them, is it possible to fetch the indexableText for each file? See https://developers.google.com/drive/api/guides/file#indexable-text

Extracted indexableText from google documents/slides/presentations (PUSHED)

In terms of non-Google document, to avoid downloading we can convert

PDF/DOC/DOCX => google document
PPT/PPTX => google presentation
XLS/XLSX => google spreadsheet

and then can get indexableText from converted documents. But this requires write permission.

      // Convert the file(doc/docx, ppt/pptx, xls/xlsx) to a Google Docs/Slides/Presentations format
      const response = await drive.files.copy({
        fileId: fileId,
        requestBody: {
          mimeType: mimeType, // Convert to Google Docs/Slides/Presentations
        },
      });

      const documentId = response.data.id; // newly created file ID

      // Process as like google documents

Otherwise, we should download non-Google document to parse content.

Please confirm which one we will proceed, @tahpot

chime3 commented 2 months ago

We don't want to store the files, we are just extracting text from documents that have text content.

Yeah, we are extracting text now not storing file.

chime3 commented 2 months ago

Let's not parse spreadsheets. Convert presentations to PDF, then parse them.

We can get indexableText from google spreadsheets now. Do you still not want to parse spreadsheets?

chime3 commented 2 months ago

Yes to parsing .doc and .ppt.

Okay Not at this stage.

Okay I think we should list them and include a link to them, but not parse them.

Okay

tahpot commented 2 months ago

We can get indexableText from google spreadsheets now. Do you still not want to parse spreadsheets?

Let's use indexableText for any file that supports it, so yes to spreadsheets.

tahpot commented 2 months ago

Instead of downloading the files and parsing them, is it possible to fetch the indexableText for each file? See https://developers.google.com/drive/api/guides/file#indexable-text

Extracted indexableText from google documents/slides/presentations (PUSHED)

In terms of non-Google document, to avoid downloading we can convert

PDF/DOC/DOCX => google document

PPT/PPTX => google presentation

XLS/XLSX => google spreadsheet

and then can get indexableText from converted documents. But this requires write permission.

Let's download and parse. Add an option in the handler for maxFileSizeToIndex with default = 10MB.

tahpot commented 2 months ago

One more change.

Please switch from using the document schema, to using this new file schema: https://common.schemas.verida.io/file/v0.1.0/schema.json

For google documents, fileDataId should always be undefined, but uri should always be defined.

chime3 commented 2 months ago

One more change.

Please switch from using the document schema, to using this new file schema: https://common.schemas.verida.io/file/v0.1.0/schema.json

For google documents, fileDataId should always be undefined, but uri should always be defined.

Updated schema and passed unit test. Also checked Google drive data using web interface. gdrive-tests

chime3 commented 2 months ago

One more thing in extracting extension @tahpot

Extracting extension from file name using period first: this can cause error because filenames might use periods for its own, not for extension.
If not possible from file name, then use mimetype to determine extension.

We can not recognize all file extensions. Please clarify on this if we use type constraints or not.

tahpot commented 2 months ago

Feedback:

Can confirm tests pass
There was an issue with Google Slides API not being enabled, so I have enabled it. Will we have the same issue with other types like drawings or sheets?
When running the sync on my gmail, I receive this error in the sync logs and no files are saved (Export only supports Docs Editors files.). From this stackoverflow it indicates files.get() may need to be used instead?
The failure of one item, shouldn't break the whole sync process, so I'm not sure what is happening to prevent any items being saved.

chime3 commented 2 months ago

What has been changed?

Replaced pptx-parser with officeparser
Ignored doc and ppt because those types are deprecated in recent libraries and we need to use libreoffice-convert to convert doc/ppt into docx/pptx to use trending packages, but libreoffice-convert requires libreoffice pre installed. So ignored parsing doc and ppt.

chime3 commented 2 months ago

Feedback:

Can confirm tests pass

There was an issue with Google Slides API not being enabled, so I have enabled it. Will we have the same issue with other types like drawings or sheets?

When running the sync on my gmail, I receive this error in the sync logs and no files are saved (Export only supports Docs Editors files.). From this stackoverflow it indicates files.get() may need to be used instead?

The failure of one item, shouldn't break the whole sync process, so I'm not sure what is happening to prevent any items being saved.

All done

chime3 commented 2 months ago

All good to merge once the handler is updated to handle the last page fix in the other google handlers.

Okay, thank you

verida / data-connector-server