ml-struct-bio / cryodrgn

Neural networks for cryo-EM reconstruction
http://cryodrgn.cs.princeton.edu
GNU General Public License v3.0
307 stars 76 forks source link

Support .cs file writing for export to cryoSPARC #150

Open zhonge opened 1 year ago

zhonge commented 1 year ago

We should have a tool cryodrgn_utils write_cs to streamline re-importing particles to cryoSPARC.

If the input to cryoDRGN originated from cryoSPARC, this tool could keep certain (all?) fields like the uid from a reference .cs file. Then, the reimport to cryoSPARC wouldn't define an entirely new dataset in their database.

I'm not 100% on what we would need to implement yet, but maybe we can refine the idea in this thread.

zhonge commented 1 year ago

Related to issues #72, #101, #148.

zhonge commented 1 year ago

One additional complexity is that particles are typically downsampled before training cryoDRGN, so the question is do we want the .cs file to point to the (new) downsampled particle stack or refer to the original extracted particles (more error-prone).

In the latter case, the information that cryoDRGN provides is an index filtering. So maybe it makes sense to have a cryodrgn_utils filter_cs tool instead.

Guillawme commented 1 year ago

Being able to send a selection of particles back to cryoSPARC would be so useful!

And I think it is best as only a selection, pointing to the original particles in the cryoSPARC project (not re-importing the downsampled particles), because the typical use case is to refine a subset of particles to high resolution, using the particles at their original pixel size.

Guillawme commented 1 year ago

Related to issues https://github.com/zhonge/cryodrgn/issues/72, https://github.com/zhonge/cryodrgn/issues/101, https://github.com/zhonge/cryodrgn/issues/148.

Also #109

vineetbansal commented 1 year ago

Just recapping some salient points from discussions with @zhonge about this:

A corresponding write_cs command will be implemented exactly as outlined above, except that it will take in either an input .cs file or an input .mrcs/.txt file. The command will mimic the behavior above, except that for .cs files.

A couple of additional points I'd like to propose here:

Guillawme commented 1 year ago

Seems like a great plan!

Sorry to bring this up again (https://github.com/zhonge/cryodrgn/pull/70#issuecomment-941061328), maybe you already discussed it and chose to not use an external library, but just in case you didn't: check out the starfile library, its goal is compatibility with star files from RELION, and it might simplify all your handling of star files (it turns star files into pandas dataframes and vice versa).

vineetbansal commented 1 year ago

Hi @Guillawme - I agree on using the starfile library. However, I'd like to handle that as a separate issue so the migration can be independent of anything we do here. I'll create an issue on that and hopefully we can get it done quickly.

olibclarke commented 1 year ago

I think write_cs will also need to write a .csg file, right? In order to be able to import into Csparc using "Import Result Group"?

e.g. something like this - simple metadata file (see here for details):


group:
  description: A stack of imported particles. May or may not contain data, ctfs, pick
    locations, etc.
  name: imported_particles
  title: Imported particles
  type: particle
results:
  blob:
    metafile: '>J4539_imported_particles_exported.cs'
    num_items: 1616
    type: particle.blob
  ctf:
    metafile: '>J4539_imported_particles_exported.cs'
    num_items: 1616
    type: particle.ctf
version: v4.0.1```
zhonge commented 1 year ago

Here's how to reimport a particle stack filtered by cryoDRGN back into cryoSPARC, while pointing to the original particles in the cryoSPARC project.

  1. Export the original cryoSPARC particles from the associated Job so that there's a single .cs and .csg file describing the particle stack (i.e. no PXXX_JYYY_passthrough_particles.cs file). You can do this with the "Export" button in the Outputs tab.

image

  1. You'll find the .cs and .csg files in the exports subdirectory of your project directory: /path/to/project/directory/PXXX/exports/groups/JYYY_particles

  2. Filter the .cs file with the index selection .pkl file using cryodrgn_utils write_cs. For example, here is the command to filter J929_particles_exported.cs by a selection saved in ind_keep.214511_particles.pkl and save out a new J929_particles_filtered.cs file:

(cryodrgn) $ cryodrgn_utils write_cs J929_particles_exported.cs --ind ind_keep.214511_particles.pkl -o J929_particles_filtered.cs
  1. Make a copy of the .csg text file and replace the metafile field with the new .cs filename and the num_items field with the new number of particles. Here's a comparison of the before and after:
(cryodrgn) [Sat Mar 11 23:50 J929_particles] sdiff J929_particles_exported.csg J929_particles_filtered.csg
created: 2023-03-12 03:52:35.411011             created: 2023-03-12 03:52:35.411011
group:                              group:
  description: All particles that were processed, including a     description: All particles that were processed, including a
  name: particles                         name: particles
  title: All particles                        title: All particles
  type: particle                          type: particle
results:                            results:
  alignments2D:                           alignments2D:
    metafile: '>J929_particles_exported.cs'           |     metafile: '>J929_particles_filtered.cs'
    num_items: 286801                         |     num_items: 214511
    type: particle.alignments2D                     type: particle.alignments2D
  alignments3D:                           alignments3D:
    metafile: '>J929_particles_exported.cs'           |     metafile: '>J929_particles_filtered.cs'
    num_items: 286801                         |     num_items: 214511
    type: particle.alignments3D                     type: particle.alignments3D
  blob:                               blob:
    metafile: '>J929_particles_exported.cs'           |     metafile: '>J929_particles_filtered.cs'
    num_items: 286801                         |     num_items: 214511
    type: particle.blob                         type: particle.blob
  ctf:                                ctf:
    metafile: '>J929_particles_exported.cs'           |     metafile: '>J929_particles_filtered.cs'
    num_items: 286801                         |     num_items: 214511
    type: particle.ctf                          type: particle.ctf
  location:                           location:
    metafile: '>J929_particles_exported.cs'           |     metafile: '>J929_particles_filtered.cs'
    num_items: 286801                         |     num_items: 214511
    type: particle.location                     type: particle.location
  pick_stats:                             pick_stats:
    metafile: '>J929_particles_exported.cs'           |     metafile: '>J929_particles_filtered.cs'
    num_items: 286801                         |     num_items: 214511
    type: particle.pick_stats                       type: particle.pick_stats
version: v4.1.2                         version: v4.1.2
  1. In cryoSPARC, use the "Import Results Group" job type and reimport the new .csg file. :tada:

We can probably have cryodrgn_utils write_cs write out the csg file as well as @olibclarke suggested to skip over Step 4. It may be worth looking at the new cryosparc-tools API to see if there's a better way to write out the .csg file.

Guillawme commented 11 months ago

I just tried this and it seems it is going to work! :tada: (The file was generated, but I will know for sure when I'm able to copy these newly generated .csg and .cs files to the correct location; on our cluster we don't have write permission to the cryosparc project directory, but cryosparc will only import result groups from there, so I need somebody else to copy the files for me or change permissions.)

It would be great for usability if this tool could work this way (merging steps 3 and 4 above, as you say):