pachterlab / seqspec

machine-readable file format for genomic library sequence and structure
MIT License
114 stars 17 forks source link

seqspec for parse wt-mega v2 kit #9

Closed detrout closed 1 year ago

detrout commented 1 year ago

This is my stab at a seqspec for the nextseq & novaseq reads we have built using the Parse biosystem WT-mega v2 kit.

As far as I can tell it validates correctly.

detrout commented 1 year ago

I rebased my parse-wt commit onto your current head so it should work.

detrout commented 1 year ago

Oh it occurred to me should this be in the SPLiT-seq directory as a different version? I wasn't completely sure if SPLiT-seq was the v1 protocol.

sbooeshaghi commented 1 year ago

Hi Diane, could you take a look at this PR and modify the parse spec and files accordingly? The changes mostly have to do with selecting one million sequencing reads and uploading them

https://github.com/IGVF/seqspec/pull/12

thank you!

detrout commented 1 year ago

Ok I tried to update the pull request with the example fastq where you asked.

Though I do wonder if instead of just commiting the fastq.gz files if we should be switching to git-lfs for these large files?

sbooeshaghi commented 1 year ago

Hi Diane, could you please run seqspec check and verify that the spec has no errors?

detrout commented 1 year ago

Sorry I forgot for a while.

I just ran it against 1daef17dae0bcc99178d49b52096b88a9d49b8c4

$python3 -m seqspec.seqspec_check specs/parse-wt-v2/wt-mega-v2.yaml 
$ 
sbooeshaghi commented 1 year ago

Can you run the spec against the head of the IGVF seqspec main branch?

detrout commented 1 year ago

Oops. I was calling the validator wrong.

I rebased against c52b53002a81fbcfcd41b03c3c8cf35218a69741 and made some fixes.

Though what's the difference between N and X for the sequence string?

detrout commented 1 year ago

Oops. I was calling the validator wrong.

I rebased against c52b53002a81fbcfcd41b03c3c8cf35218a69741 and made some fixes.

Though what's the difference between N and X for the sequence string?

sbooeshaghi commented 1 year ago

Looks great! I merged- could you add the R1.fastq.gz when you get a chance?

dbrg77 commented 1 year ago

Hi both,

Thanks @detrout for providing the specs and whitelist.

I've been trying to figure out the details of ParseBio recently, and just realised that there might be some problems with the current spec for ParseBio.

As far as I know, the ParseBio structure should be:

[10-bp UMI][8-bp Round3 barcode]GTGGCCGATGTTTCGCATCGGCGTACGACT[8-bp Round2 barcode]ATCCACGTGCTTGAGACTGTGG[8-bp Round1 barcode](dT)

See this thread. The above structure seem to be correct based on real data on GEO. In general, the current spec is okay, but the first onlist should be barcode-23_onlist.txt, the second onlist should be barcode-23_onlist.txt as well, the third onlist should be barcode-1_onlist_v2.txt. In the current version, the order of the onlist is reversed.

Then, I also have a problem with the sequence in barcode-1_onlist_v2.txt. The sequences here do not match real data, for example, SRR13948565 .

@detrout : May I ask where the onlist sequences come from? Does ParseBio provide it? The kit is not available in China, so I don't know where to get it.

Finally, sorry if this is not the best place to discuss. We could open a new discussion if needed.

Xi

siggoe commented 7 months ago

Hello,

I would be interested in figuring out the linker sequence for the first barcoding step as I'd like to add gene specific primers in the RT step. Did anyone figure out/confirm if the sequences were correct?

Thanks a lot!