Enable pyfaidx to accept gcp paths for compressed fasta files

archanaraja commented 4 years ago

Hi, Currently Im getting an error , loading compressed fasta files from gcp paths, it would be great if this feature in enabled like Pandas. Are there any plans to have it in the near future thanks. Archana

mdshw5 commented 4 years ago

Hey @archanaraja thanks for raising this issue and apologies for the late response. I'm not familiar with the Google Cloud Storage apis and was not planning to implement this. If you can describe your use case in a bit more detail I may be able to help. It looks like google's python package implements byte ranges, so assuming the FASTA index file is present I can imagine pyfaidx reading the FAI into memory and then making calls to GCP for the specific sequences we need. If the FASTA is not indexed then pyfaidx would need to stream the entire FASTA and produce an index. That's not too efficient and also brings in the issue of what to do with the newly created FASTA index (do we store it for re-use somewhere or do we rebuild the index from scratch the next time we initialize?).

archanaraja commented 4 years ago

Thank you for the prompt response, I was trying to use the kit to compute length of fastqs in a file as a qc , since it requires indexing as they are not present it looks like there is not an easy fix. All my data is on gcp , pandas package could directly read files from GCP looks like I should use byte ranges like you suggested to make faidx work.

Thanks for the detailed explanation.

Archana

From: Matt Shirley notifications@github.com Reply-To: mdshw5/pyfaidx reply@reply.github.com Date: Friday, May 29, 2020 at 5:13 PM To: mdshw5/pyfaidx pyfaidx@noreply.github.com Cc: Archana Natarajan Raja araja7@stanford.edu, Mention mention@noreply.github.com Subject: Re: [mdshw5/pyfaidx] Enable pyfaidx to accept gcp paths for compressed fasta files (#161)

Hey @archanarajahttps://github.com/archanaraja thanks for raising this issue and apologies for the late response. I'm not familiar with the Google Cloud Storage apis and was not planning to implement this. If you can describe your use case in a bit more detail I may be able to help. It looks like google's python package implements byte rangeshttps://googleapis.dev/python/storage/latest/client.html#google.cloud.storage.client.Client.download_blob_to_file, so assuming the FASTA index file is present I can imagine pyfaidx reading the FAI into memory and then making calls to GCP for the specific sequences we need. If the FASTA is not indexed then pyfaidx would need to stream the entire FASTA and produce an index. That's not too efficient and also brings in the issue of what to do with the newly created FASTA index (do we store it for re-use somewhere or do we rebuild the index from scratch the next time we initialize?).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/mdshw5/pyfaidx/issues/161#issuecomment-636243762, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABSZAPZLI5S6TTHLBNLG6BTRUBFRPANCNFSM4NHF6YJA.

mdshw5 / pyfaidx

Enable pyfaidx to accept gcp paths for compressed fasta files #161