mingwhy / bioinfo_homemade_tools

0 stars 0 forks source link

Globus data downloading and 10x single cell data preprocessing #3

Open mingwhy opened 1 year ago

mingwhy commented 1 year ago

10x single cell data preprocessing

open web page, click ENDPOINTS, click 'Create a personal endpoint', it would ask you if you'd like download a Globus installation dmg file.

Use Globus 'Preference' panel to delete previous Globus.

Then re-install, it would ask you to confirm your email and user name.

After installation, open the web page again, it should show your local end point on the file transfer page and on your 'Bookmarks' -> 'Your Connections' page.

use Globus to download data.

You need to install Globus client, and delete previous 'connection'. create a new connection on Globus and sync/transfer data.

The downloaded file contains a lot of files, what you need is all in the path: /Users/mingyang/Downloads/Promislow_Lab/fly_DP_10brains_done/outs/fastq_path/HFWWGDRXY/

We have 4 samples, each sample was ran on two lanes. There are 4 files (index1, index2, R1, R2) for each sample per lane. What you need is the fasta files F1_S2_L001_R1_001.fastq.gz and F1_S2_L001_R2_001.fastq.gz.

Tip: Your FASTQ files must follow the Illumina naming convention, ex. SampleName_S1_L001_R1_001.fastq.gz.

for example: F1_S2_L001_I1_001.fastq.gz F1: sample name S2: sample number based on the order that samples are listed in the sample sheet L001: the lane number R1—The read. In this example, R1 means Read 1. For a paired-end run, there is at least one file with R2 in the file name for Read 2. When generated, index reads are I1 or I2. 001—The last segment is always 001

our purchased kit: Chromium Next GEM Single Cell 3’ GEM, Library & Gel Bead Kit v3.1, 4 rxns PN-1000128 https://support.10xgenomics.com/single-cell-gene-expression/library-prep/doc/user-guide-chromium-single-cell-3-reagent-kits-user-guide-v31-chemistry our dual index kit: the Dual Index Kit TT Set A (PN-1000215) https://kb.10xgenomics.com/hc/en-us/articles/360036953011-Where-can-I-find-the-Dual-Index-Kit-TT-Set-A-PN-1000215-sample-index-sequences-

sign up or sign in 10x cloud computing platform.

https://www.10xgenomics.com/products/cloud-analysis https://support.10xgenomics.com/cloud-analysis/billing

create project on 10x cloud and upload fastq data.

once you upload data, 10x cloud store it for free within 90days, then \$0.02 per GB per month

!!No reads quality processing done at this point!! Following https://support.10xgenomics.com/cloud-analysis/uploading-fastqs#download, I install 'the 10x Genomics Cloud CLI', then upload fastq data.

I install it in my home folder: /Users/mingyang/txg-macos-v1.1.1/

curl -f -o txg-macos-v1.1.1.zip "https://cf.10xgenomics.com/cloud-cli/1.1.1/txg-macos-v1.1.1.zip"
unzip txg-macos-v1.1.1.zip
rm txg-macos-v1.1.1.zip
cd txg-macos-v1.1.1/
./txg fastqs upload --project-id 6nwIDDvISmWAaxskXdrT0A /Users/mingyang/Downloads/Promislow_Lab/fly_DP_10brains_done/outs/fastq_path/HFWWGDRXY

enter the token on the website and the data upload step starts right away.

I also install txg tools on the server:

$ curl -f -o txg-linux-v1.1.1.tar.gz https://cf.10xgenomics.com/cloud-cli/1.1.1/txg-linux-v1.1.1.tar.gz
tar -zxvf txg-linux-v1.1.1.tar.gz
$ cd txg-linux-v1.1.1/
$ ./txg help

and on server: /gscratch/csde-promislow/mingy16/txg-macos-v1.1.1/

start analyze data on 10x Cloud (data storage free time: 90 days)

select library type: as we use SI-TT-A5 ~ SI-TT-A8. I emailed 10x tech people, our case is: 'ST', standing for 'standard', library type. https://support.10xgenomics.com/cloud-analysis/supported-products

create your own transcriptome referencer

when you start analysis, you need to use a transcriptome reference, 10x doesn't have fly genome as the reference, you need to upload one yourself.

reference upload guidelines: https://cloud.10xgenomics.com/cloud-analysis/custom-references https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references

check out build_fly_ref.txt file.

upload your own transcriptome reference

Upload following https://support.10xgenomics.com/cloud-analysis/custom-references#upload

$ tar -czvf BDGP6.32.tar.gz BDGP6.32/
$ cd ../txg-linux-v1.1.1/
$ ./txg references upload ../build_fly_ref/BDGP6.32.tar.gz 

You need to enter your token, you can find it in your Account Setting (https://cloud.10xgenomics.com/account/security) For me, it's 0a20f71b186d02f6ae0e023c27f4d75b8181e634d232c5c184e52197a6f72b77

txg tool would verify all needed files in this ref and upload them, ~2mins.

Verifying all required files are present...

All required files are present.

You are about to upload 1 custom reference to your 10x
Genomics Cloud account.

Custom reference name: "BDGP6.32"
Reference type: GEX
Cell Ranger mkref version: cellranger-6.1.1

Contents:
...

create analysis on 10x cloud

click on the reads files and 'create a new analysis'.

Create a Single Cell Gene Expression Analysis Running this analysis will align reads to the reference, perform cell detection, generate the gene-barcode matrix, identify cells clusters and perform differential expression analysis. This is equivalent to running the 10x Genomics pipeline Cell Ranger count v6.1.1.

If you've already uploaded your own reference, you'd be able to see it show up in 'Transcriptome reference'.

Then just start the analysis, after the analysis is launched, you would see a message like this:

We’ll email you when the analysis is complete. This could take several hours depending on the size of your data. Feel free to close this window or start another analysis in the meantime. Made a mistake? Cancel analysis

mingwhy commented 1 year ago

build a fly reference genome for 10x reads mapping

Download and Extract Cell Ranger

https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_in#download I download and install it on hyak server

 #after log in hyak, require for computation resource
 $ srun -p build --time=6:00:00 --mem=200G --pty /bin/bash

 $ pwd
 # /gscratch/csde-promislow/mingy16/build_fly_ref

 $ curl -o cellranger-6.1.1.tar.gz "https://cf.10xgenomics.com/releases/cell-exp/cellranger-6.1.1.tar.gz?Expires=1630567777&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9jZi4xMHhnZW5vbWljcy5jb20vcmVsZWFzZXMvY2VsbC1leHAvY2VsbHJhbmdlci02LjEuMS50YXIuZ3oiLCJDb25kaXRpb24iOnsiRGF0ZUxlc3NUaGFuIjp7IkFXUzpFcG9jaFRpbWUiOjE2MzA1Njc3Nzd9fX1dfQ__&Signature=Ta4L6k9JMVaQMbsbl07sYeqFlijZXArBRvGT2Q3V2Z4Fg9gT69TpSIvPqUFJ4mybjjXnL-HjIyAXGfjDfG11a8BQs5FOlJOpm3q6VtJpwkztKaNBhcKSLTXJyuhb5ZTaHb1DsQmL8d0u0hPU0Vs6TAxjgMqAcvtArvslFRgk2laN3V7FdLy20HeaxPdhTtnAsTW4WSt4C7r8LHV3mKJytMjFgN2IxPStnEplHCYuXNhzkwm00E61uLZvJ6fch1E4L2DtSWYZjstnfNqH6Ke4a1z4xYpv7rBXXxGmf8jDcWOFe~2UoZIsmlZKJPE1ncRnFRwnwRNVyiOURCjytQztmQ__&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA"

 tar -zxvf cellranger-6.1.1.tar.gz 
 cd cellranger-6.1.1/

 # add cellranger to your home path
 vim ~/.bashrc 
 # export PATH=/gscratch/csde-promislow/mingy16/build_fly_ref/cellranger-6.1.1:$PATH
 source ~/.bashrc 
 which cellranger 
 #/gscratch/csde-promislow/mingy16/build_fly_ref/cellranger-6.1.1/cellranger

download fly GTF and genome FASTA files

online tutorial: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_mr

I mainly followed this one: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_mr#macaque_6.0.0

Look at the README on 'http://ftp.ensembl.org/pub/release-104/fasta/drosophila_melanogaster/dna/' I downloaded 'Drosophila_melanogaster.BDGP6.32.dna_rm.toplevel.fa.gz'.

$ curl -o http://ftp.ensembl.org/pub/release-104/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.32.104.chr.gtf.gz
$ wget http://ftp.ensembl.org/pub/release-104/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.32.dna_rm.toplevel.fa.gz

Filter the GTF

https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_mr#filter

$ gzip -cd Drosophila_melanogaster.BDGP6.32.104.chr.gtf.gz > Drosophila_melanogaster.BDGP6.32.104.chr.gtf
$ cellranger mkgtf \
  Drosophila_melanogaster.BDGP6.32.104.chr.gtf \
  Drosophila_melanogaster.BDGP6.32.104.chr.filtered.gtf \
  --attribute=gene_biotype:protein_coding --attribute=gene_biotype:lincRNA  --attribute=gene_biotype:antisense --attribute=gene_biotype:IG_LV_gene --attribute=gene_biotype:IG_V_gene --attribute=gene_biotype:IG_V_pseudogene  --attribute=gene_biotype:IG_D_gene --attribute=gene_biotype:IG_J_gene --attribute=gene_biotype:IG_J_pseudogene --attribute=gene_biotype:IG_C_gene --attribute=gene_biotype:IG_C_pseudogene --attribute=gene_biotype:TR_V_gene --attribute=gene_biotype:TR_V_pseudogene --attribute=gene_biotype:TR_D_gene --attribute=gene_biotype:TR_J_gene --attribute=gene_biotype:TR_J_pseudogene --attribute=gene_biotype:TR_C_gene

Run cellranger mkref

$ gzip -cd Drosophila_melanogaster.BDGP6.32.dna_rm.toplevel.fa.gz >Drosophila_melanogaster.BDGP6.32.dna_rm.toplevel.fa

$ cellranger mkref --genome=BDGP6.32 --fasta=Drosophila_melanogaster.BDGP6.32.dna_rm.toplevel.fa --genes=Drosophila_melanogaster.BDGP6.32.104.chr.filtered.gtf

A folder named 'BDGP6.32' is generated, and it's done~