verida / data-connector-server

1 stars 2 forks source link

Google drive document handler #86

Closed chime3 closed 2 months ago

chime3 commented 3 months ago

Closes #85

What's changed

How to test these changes

What's pending

export enum DocumentType {
    TXT = "txt", // Done
    PDF = "pdf", // Done
    DOC = "doc",
    DOCX = "docx",
    XLS = "xls",
    XLSX = "xlsx",
    PPT = "ppt",
    PPTX = "pptx",
    OTHER = "other" // zip, mp4????? 
}
tahpot commented 2 months ago

Instead of downloading the files and parsing them, is it possible to fetch the indexableText for each file? See https://developers.google.com/drive/api/guides/file#indexable-text

According to ChatGPT:

To fetch the indexable text property of a Google Drive file using the Google Drive API, you'll need to use the files.get method with the fields parameter specifying the indexableText field.


We don't want to store the files, we are just extracting text from documents that have text content.

We can export google spreadsheet and presentation as pdf or other corresponding types like ppt, xls etc. Do you want to process them as pdf to parse?

Let's not parse spreadsheets. Convert presentations to PDF, then parse them.

doc, xls, ppt etc are non-Google document. Do you want to parse them as well?

Yes to parsing .doc and .ppt.

There are many non-Google document in addition to DocumentType. Are you planning to extend supported file types?

Not at this stage.

Should we list non-document files in G Drive like zip, mp4 as well? I mean OTHER type will include zip, mp4 etc?

I think we should list them and include a link to them, but not parse them.

chime3 commented 2 months ago

Instead of downloading the files and parsing them, is it possible to fetch the indexableText for each file? See https://developers.google.com/drive/api/guides/file#indexable-text

Extracted indexableText from google documents/slides/presentations (PUSHED)

In terms of non-Google document, to avoid downloading we can convert

and then can get indexableText from converted documents. But this requires write permission.

      // Convert the file(doc/docx, ppt/pptx, xls/xlsx) to a Google Docs/Slides/Presentations format
      const response = await drive.files.copy({
        fileId: fileId,
        requestBody: {
          mimeType: mimeType, // Convert to Google Docs/Slides/Presentations
        },
      });

      const documentId = response.data.id; // newly created file ID

      // Process as like google documents      

Otherwise, we should download non-Google document to parse content.

Please confirm which one we will proceed, @tahpot

chime3 commented 2 months ago

We don't want to store the files, we are just extracting text from documents that have text content.

Yeah, we are extracting text now not storing file.

chime3 commented 2 months ago

Let's not parse spreadsheets. Convert presentations to PDF, then parse them.

We can get indexableText from google spreadsheets now. Do you still not want to parse spreadsheets?

chime3 commented 2 months ago

Yes to parsing .doc and .ppt.

Okay Not at this stage.

Okay I think we should list them and include a link to them, but not parse them.

Okay

tahpot commented 2 months ago

We can get indexableText from google spreadsheets now. Do you still not want to parse spreadsheets?

Let's use indexableText for any file that supports it, so yes to spreadsheets.

tahpot commented 2 months ago

Instead of downloading the files and parsing them, is it possible to fetch the indexableText for each file? See https://developers.google.com/drive/api/guides/file#indexable-text

Extracted indexableText from google documents/slides/presentations (PUSHED)

In terms of non-Google document, to avoid downloading we can convert

  • PDF/DOC/DOCX => google document
  • PPT/PPTX => google presentation
  • XLS/XLSX => google spreadsheet

and then can get indexableText from converted documents. But this requires write permission.

Let's download and parse. Add an option in the handler for maxFileSizeToIndex with default = 10MB.

tahpot commented 2 months ago

One more change.

Please switch from using the document schema, to using this new file schema: https://common.schemas.verida.io/file/v0.1.0/schema.json

For google documents, fileDataId should always be undefined, but uri should always be defined.

chime3 commented 2 months ago

One more change.

Please switch from using the document schema, to using this new file schema: https://common.schemas.verida.io/file/v0.1.0/schema.json

For google documents, fileDataId should always be undefined, but uri should always be defined.

Updated schema and passed unit test. Also checked Google drive data using web interface. gdrive-tests image

chime3 commented 2 months ago

One more thing in extracting extension @tahpot

We can not recognize all file extensions. Please clarify on this if we use type constraints or not.

tahpot commented 2 months ago

Feedback:

chime3 commented 2 months ago

What has been changed?

chime3 commented 2 months ago

Feedback:

  • Can confirm tests pass
  • There was an issue with Google Slides API not being enabled, so I have enabled it. Will we have the same issue with other types like drawings or sheets?
  • When running the sync on my gmail, I receive this error in the sync logs and no files are saved (Export only supports Docs Editors files.). From this stackoverflow it indicates files.get() may need to be used instead?
  • The failure of one item, shouldn't break the whole sync process, so I'm not sure what is happening to prevent any items being saved.

All done

chime3 commented 2 months ago

All good to merge once the handler is updated to handle the last page fix in the other google handlers.

Okay, thank you